The output, from what I've seen, was okay? I don't know if it is that much better, and I think LLMs gain a lot by there not being an actual, objective measure by which you can compare two different models.
Sure, there are some coding competitions, there are some benchmarks, but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
Is there any reasonable way to compare outputs of fuzzy algorithms anyways? It is still an algorithm under the hood, with defined inputs, calculations and outputs, right? (just with a little bit of randomness defined by a random seed)
I have a dozen or so very random prompts I feed into every new model that are based on things I’m very knowledgeable and passionate about, and compare the outputs. A couple are directly coding related, a couple are “write a few paragraphs explaining <technical thing>”, and the rest are purely about non-computer hobbies, etc.
I’ve found it way more useful for me personally than any of the “formal” tests, as I don’t really care how it scores on random tests but instead very much do care how well it does my day to day things.
It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
> It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.
I coined Murrai Gell-Mann for this sort of test of ai.
You ask people to rate them, possibly among multiple dimensions. People are much better at resolving comparisons than absolute assessments. https://lmarena.ai/
> but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
yes? I mean, if you were really doing this, you could make both and see how they turned out. Or, if you were familiar with doing this and were just looking for a quick refresher, you'd know if something was off or not.
but just like everything else on the interweb, if you have no knowledge except for what ever your search result presented, you're screwed!
Sure, there are some coding competitions, there are some benchmarks, but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?
Is there any reasonable way to compare outputs of fuzzy algorithms anyways? It is still an algorithm under the hood, with defined inputs, calculations and outputs, right? (just with a little bit of randomness defined by a random seed)