The output, from what I've seen, was okay? I don't know if it is that much bette...

sen · 2025-02-28T23:27:34 1740785254

I have a dozen or so very random prompts I feed into every new model that are based on things I’m very knowledgeable and passionate about, and compare the outputs. A couple are directly coding related, a couple are “write a few paragraphs explaining <technical thing>”, and the rest are purely about non-computer hobbies, etc.

I’ve found it way more useful for me personally than any of the “formal” tests, as I don’t really care how it scores on random tests but instead very much do care how well it does my day to day things.

It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.

genewitch · 2025-03-01T18:16:27 1740852987

> It’s like listening to someone in the media talk about a topic you’re passionate about, and you pick up on all the little bits and pieces that aren’t right. It’s a gut feel and very unscientific but it works.

I coined Murrai Gell-Mann for this sort of test of ai.

I hope it takes off!

esafak · 2025-02-28T23:13:57 1740784437

You ask people to rate them, possibly among multiple dimensions. People are much better at resolving comparisons than absolute assessments. https://lmarena.ai/

what · 2025-03-01T04:18:05 1740802685

That only works if the people doing the rating are experts on the topic of the answer.

dylan604 · 2025-02-28T23:02:43 1740783763

> but can you really check if the recipe for banana bread output by Claude is better than the one output by ChatGPT?

yes? I mean, if you were really doing this, you could make both and see how they turned out. Or, if you were familiar with doing this and were just looking for a quick refresher, you'd know if something was off or not.

but just like everything else on the interweb, if you have no knowledge except for what ever your search result presented, you're screwed!

amelius · 2025-02-28T23:15:39 1740784539

There are benchmarks, but they can be gamed.