More

mikkelam · 2026-03-17T19:23:12 1773775392

Why are we treating LLM evaluation like a vibe check rather than an engineering problem?

Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?

ainch · 2026-03-17T23:50:28 1773791428

I don't think it's just an engineering problem - decades of research have failed to produce a convincing, general definition of intelligence, capability or agency. You can try to form proxy metrics by combining benchmarks, but existing benchmarks are flawed, and should be taken with a pinch of salt.

It's evident in the fact that every time AI has historically met certain thresholds (chess-playing, the Turing Test, fluent language), we play with them a little more and find out there's something still lacking.

tanaros · 2026-03-17T19:25:39 1773775539

Whenever somebody makes a benchmark, people complain that the benchmark results are meaningless because they’re gamed. I don’t know why those same people don’t understand that grading on vibes is strictly worse.

tintor · 2026-03-17T19:29:29 1773775769

Depends on benchmark.

If questions are fixed they are trivial to game.

pizza · 2026-03-17T19:38:24 1773776304

There’s a Dark Forest problem for evals. As soon as they’re made public they start running out of time to be useful. It’s also not clear how to predict how the model will perform on a task based on an eval. Or even whether, given two skills that the model can individually do well on in the evals, it still does well on their composition. It might at this point be better to be scientific in unscientific approaches, than to attribute more power to relatively weakly predictive evals than they actually have

bisonbear · 2026-03-18T03:17:29 1773803849

I agree with your analysis but not the conclusion.

Evals are broken - OpenAI showed that SWE Bench Verified was in the training data - models were able to reconstruct the changes from memory (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)

However, this doesn't mean we should completely give up on benchmarking. In fact, as models get more intelligent, and we give them more autonomy, I believe that tracking agent alignment to your coding standards becomes even more important.

What I've been exploring is making a benchmark that is unique per-repo - answering the question of how does the coding agent perform in my repo doing my tasks with my context. No longer do we have to trust general benchmarks.

Of course there will still be difficulties and limitations, but it's a step towards giving devs more information about agent performance, and allowing them to use that information to tweak and optimize the agent further

H8crilA · 2026-03-17T21:21:45 1773782505

Someone else already wrote it, but it's just too funny to not abuse:

Evals are bad because people learn and fit to them. So we do extremely small evals instead.

xandrius · 2026-03-17T19:49:04 1773776944

Is "Dark Forest problem" an actual name? I just heard of the hypothesis and it has nothing to do with how you used it in this context.

pizza · 2026-03-17T21:22:02 1773782522

I meant in the sense of - you have benchmarkers and trainers. If you publicize your evaluation, trainers may likely have their models 'consume' it, even if only indirectly: another person creating their own benchmark from scratch may be influenced by yours, even if the new question sets are clean-room. That, and the rule of thumb that benchmark value dissipates like sqrt(age) [0]

So there is a definite advantage to never publicizing your internal benchmark. But then, no one else can replicate your findings. You should assume that the space of benchmarks that are actually decent at evaluating model performance is much larger and most of the good ones, the ones that were costliest to produce, are hidden, and might not even correspond very well with the public ones. And that the public expensive benchmarks are selective and have a bias towards marketing purposes.

[0] https://www.offconvex.org/2021/04/07/ripvanwinkle/

sebastiennight · 2026-03-17T19:53:21 1773777201

I believe the correct term is "Goodhart's Law": https://en.wikipedia.org/wiki/Goodhart%27s_law

Culonavirus · 2026-03-17T21:13:25 1773782005

I mean, you vibe check, then you vibe code. Makes perfect sense. (this is a joke)

mikkelam · 2026-03-01T20:07:32 1772395652

For me, GitHub CLI is the prime example of this. This CLI is so incredibly powerful when combined with regular command line tools. Agents know how to use head, tail, jq and so on to only extract the parts it needs.

The best selling point of CLIs is the ability to chain, transform and combine. MCP cannot do this.

mikkelam · 2025-12-15T17:35:36 1765820136

They have a very nice comparison in terms of performance and price https://planetscale.com/benchmarks/aurora

mikkelam · 2025-11-26T12:31:43 1764160303

I agree. If you strictly follow the syntax of "Example 1" in JavaScript (calling and awaiting on the same line), the observable output is identical to Python.

I suppose the author meant to say that if you first called your async function and then later did `await` you would have different behavior.

misiek08 · 2025-11-27T07:21:22 1764228082

In every language having any kind of asynchronous features you should get exactly same result. Other comments already mentioned how the example should look and how it differs.

In short: having other coroutine working and awaiting e.g. on sleep() you can get anything between „parent before” and „child start”. In Python is impossible, because child is not run as new task.

mikkelam · 2025-09-24T18:24:11 1758738251

I've been thinking a lot about building an open source dating app as a non-profit offering.

I have a sense that succesful dating contributes highly to overall human happiness. It should be a public service similar to wikipedia or libraries.

Free forever, fair and safe, and responsibly managed. It's probably not that expensive to run. But idunno, i'm kinda frightened to "compete" in this market

frfl · 2025-09-24T18:45:58 1758739558

As I understand it, it's not a technical problem, rather a social one first off: you can build it but it'll be "empty" compared to all other options out there, even if it's technically superior to them. Network effect and all that.

There's also a technical problem you'll have to contend with: bots and scammers... so many bots and so many scammers.

mikkelam · 2025-09-24T18:51:35 1758739895

Totally. It's a very boring part of this that one would have to contend with.

I kinda feel the same way about Facebook. Groups, events, marketplace are amazing for community building. But it's just so hard to compete with Meta.

frfl · 2025-09-24T19:16:14 1758741374

I think it's an interesting area, but I've got no time or energy to undertake such an endeavor. However, I'd be happy to talk about it and discuss it further if you'd like to. Contact info is on my profile page here.

reachableceo · 2025-09-24T20:35:47 1758746147

I emailed you re the idea.

paxys · 2025-09-24T18:39:48 1758739188

What would make this app safer than the alternatives?

mikkelam · 2025-09-24T18:43:35 1758739415

Safe can mean a lot of things to different people i guess. I would love to incorporate some sort of reputation signal.

Perhaps positive reinforcement after people have met? Or just having social links?

But yeah, i dont have it all figured out yet

Mistletoe · 2025-09-24T18:40:38 1758739238

I think you should do it. The costs for all these services are still priced like the AOL days where bandwidth mattered. I really don’t think the hosting costs could be much. I had a small dating site decades ago and the cost was almost nothing.

ambicapter · 2025-09-24T20:21:44 1758745304

I've watched speed dating events go from free to $45 in the past couple years. Not sure if that's b/c of inevitable factors in running those events or pure opportunism.

gloxkiqcza · 2025-09-24T18:30:26 1758738626

Do you think this could be facilitated via ActivityPub or is that not a viable choice?

mikkelam · 2025-09-24T18:40:10 1758739210

I think something like the matrix protocol would be better. I would be especially interested in not storing unencrypted user messages. Matrix would be a good choice for this.

mikkelam · 2025-05-08T16:31:02 1746721862

>While I respect anyone’s decision to spend their days playing pickleball, that life isn’t quite for me—at least not full time. I’m lucky to wake up every day energized to go to work

Bit of an unfair comparison though.. Most people dont retire from a job where you're literally handing people money.

That said, I'm a huge fan Bill's work post-microsoft :)

codr7 · 2025-05-08T17:48:30 1746726510

He did a good job at cleaning up his public image, no doubt that cost quite a fortune.

Still the same greedy asshole though.

Blankono · 2025-05-08T20:04:31 1746734671

I don't think at all he is a greedy asshole. He did more good than anywone else on the planet at this point.

Alone his money and the pledge he pused is breathaking.

Why would you say he is a greedy asshole while he spends all his money to help humans?

Trump is greedy. Musk thinks probably he is good but is greedy as f and wants to go to mars because he thinks earth cant be saved anymore. But Gates?

codr7 · 2025-05-12T12:51:56 1747054316

I don't need you to repeat the propaganda, I know what the official narrative is. And it's mostly lies, especially the part about Gates suddenly turning from asshole to saint.

Blankono · 2025-05-17T18:08:31 1747505311

Interesting. I would not have thought, that on hn a normal discussion with arguments and sources is not possible. But hey you seem to pref to comment for the comment sake not for discussion.

Blankono · 2025-05-12T18:36:39 1747074999

'propaganda'?

First of, i do know his history very well, i'm quite aware that he was not a saint before but that doesn't change the fact what he is currently doing and it is very good.

Do you have anything real? Like real talking points? Real sources? aything besides just shitting at him?

mikkelam · on Feb 5, 2025

How do you know if someone uses Arch Linux? Don't worry they'll tell you ;)

All kidding aside, I recently migrated to EndeavourOS, but CachyOS looks dope too

lenova · on Feb 5, 2025

Ahaha, I've become that person I guess. I only mentioned Arch as I've always used Ubuntu when using Linux desktop VMs, and even test drove Kubuntu before trying out Cachy. Apart from some brief time getting used to pacman as a package manager instead of apt, I haven't encountered any other items that felt different to Ubuntu.

LargoLasskhyfv · on Feb 6, 2025

It is.

Without much fuss.

Tu(r)ned to eleven, speed, bliss & heaven.

On BTRFS, no less!

mikkelam · on Jan 14, 2025

https://www.youtube.com/watch?v=6mmskXXetcg

anonfordays · on Jan 15, 2025

This is a fantastic response. Wokeism is a pseudo-religion, and Dawkins's response to the in vouge religion is timeless.

mikkelam · on Jan 3, 2025

mikkelam · on Dec 26, 2024

I have honestly been so excited to try this after listening to several videos of mitchell talking about his work. What a christmas present!

A terminal is so dear to us software engineers, and this seems like such a love declaration to the terminal.

Time to spend hours tuning my config!