Ask HN: How do you personally evaluate LLMs?

I’ve seen the standard evals and benchmarks for new LLMs, but they don’t really capture how I actually use them. My own test is pretty specific: whenever a new LLM drops, I ask it to “Write an advanced three.js music visualizer.” Then I compare it to older models by checking:

1. Does it use a recent version of three.js?

2. Does the generated code run out of the box?

3. How complex/innovative is the visualizer?

I’m really curious to hear about other people’s “real-world” benchmarks. What’s your personal test prompt or scenario that reveals whether a new LLM is actually useful for you? How do you decide if it’s truly better than the last version?