Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Looking at the responses. How the F have people so wildly different opinions on the relative performance of the same systems?


LLMs: unlimited use cases, all with different performances per model and approach, where a high performance on use case A doesn't mean high performance on use case B. And high performance using approach X for use case A doesn't mean high performance using approach Y for that same use case.

The use case one is bigger than the approach one, but both play a role. Most people only use LLMs for a very specific set of tasks using the same approach every time, so they base their view of them on solely the performance on this task.

That explains all of it.


Different prompts/approaches?

I "grew up", as it were, on StackOverflow, when I was in my early dev days and didn't have a clue what I was doing I asked question after question on SO and learned very quickly the difference between asking a good question vs asking a bad one

There is a great Jon Skeet blog post from back in the day called "Writing the perfect question" - https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-...

I think this is as valid as ever in the age of AI, you will get much better output from any of these chatbots if you learn and understand how to ask a good question.


Great point. I'd add that one way to get improved performance is to ask Gemini/ChatGPT to write the prompt for you. For software, have it write a spec. It's easier to tweak something that is already pretty comprehensive.


Sure but if one is bad at asking questions they would be consistently bad across chatbots


Yes, but in fact compensating for bad questions is a skill, and in my experience it is a skill excelled by Claude and poorly by Gemini.

In other words, better you are at prompting (eg you write a half page of prompt even for casual uses -- believe or not, such people do exist -- prompt length is in practice a good proxy of prompting skill), more you will like (or at least get better results with) Gemini over Claude.

This isn't necessarily good for Gemini because being easy to use is actually quite important, but it does mean Gemini is considerably underrated for what it can do.


More likely just different tasks. The frontier is jagged.


It depends wildly (really, that wildly) on what it is exactly that you're doing with them.

One of the biggest problems with practical applications of generative AI right now is that it's basically impossible to tell which models are really good at which things without trying that specific task. There are some generalizations (e.g. you can measure more abstract metrics like capacity for spatial reasoning, and they do affect performance in ways you'd expect), but there's far more uncertainty.

This is also why many people get so pissed when companies retire models. Even if the replacement is seemingly better in the metrics, it's not a given that it's better at your specific thing. Or it may be better, but only if you write a completely different prompt, and, again, the only way to discover that magic correct prompt is through experimentation. Hence why it feels less like engineering and more like shamanism a lot of the time.


A) number of times people want factual data from LLMs - the more they do it, the more they encounter gibberish generator. B) the amount of efforts to correct LLM output - some people get 80% ready output, spend some time to rewrite it to become correct and then tell on forums that LLM practically did most of the work. Other people in the same situation will say that they god gibberish and had to spend time rewriting, so LLMs are crap at that task. So we are not only seeing LLM bias, but then human reporting bias on top of it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: