Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really doubt LLM benchmarks are reflective of real world user experience ever since they claimed GPT-4o hallucinated less than the original GPT-4.


I don't have an accurate benchmark, but in my personal experience, gpt4o hallucinates substantially less than gpt4. We solved a ton of hallucination issues just by upgrading to it...


How much did you use the original GPT-4-0314?

(And even that was a downgrade compared to the more uncensored pre-release versions, which were comparable to GPT-4.5, at least judging by the unicorn test)


I don't recall the original version we used unfortunately :(

in our case, the bump was actually from gpt-4-vision to gpt-4o (the use case required image interpretation)

It got measurably better at both image cases and text-only cases


I begin to believe LLM benchmarks are like european car mileage specs. They say its 4 Liter / 100km but everyone knows it's at least 30% off (same with WLTP for EVs).


Those numbers are not off. They are tested on tracks.

You need to remove your shoe and drive with like two toes to get the speed just right, though.

Test drivers I have done this with takes off their shoes or use ballerina shoes.


Cruise control?


No you want to control the shape of the speed curve to not overshoot and not accelerate too much, when you follow the speed profile.

And keeping steady state speed is not that hard.


Hrm it is a bit funny that modern cars are drive-by-wire (at least for throttle) and yet they still require a skilled driver to follow a speed profile during testing, when theoretically the same thing could be done more precisely by a device plugged in through the OBD2 port.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: