Why are we treating LLM evaluation like a vibe check rather than an engineering problem?
Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?
I don't think it's just an engineering problem - decades of research have failed to produce a convincing, general definition of intelligence, capability or agency. You can try to form proxy metrics by combining benchmarks, but existing benchmarks are flawed, and should be taken with a pinch of salt.
It's evident in the fact that every time AI has historically met certain thresholds (chess-playing, the Turing Test, fluent language), we play with them a little more and find out there's something still lacking.
Whenever somebody makes a benchmark, people complain that the benchmark results are meaningless because they’re gamed. I don’t know why those same people don’t understand that grading on vibes is strictly worse.
There’s a Dark Forest problem for evals. As soon as they’re made public they start running out of time to be useful. It’s also not clear how to predict how the model will perform on a task based on an eval. Or even whether, given two skills that the model can individually do well on in the evals, it still does well on their composition. It might at this point be better to be scientific in unscientific approaches, than to attribute more power to relatively weakly predictive evals than they actually have
However, this doesn't mean we should completely give up on benchmarking. In fact, as models get more intelligent, and we give them more autonomy, I believe that tracking agent alignment to your coding standards becomes even more important.
What I've been exploring is making a benchmark that is unique per-repo - answering the question of how does the coding agent perform in my repo doing my tasks with my context. No longer do we have to trust general benchmarks.
Of course there will still be difficulties and limitations, but it's a step towards giving devs more information about agent performance, and allowing them to use that information to tweak and optimize the agent further
I meant in the sense of - you have benchmarkers and trainers. If you publicize your evaluation, trainers may likely have their models 'consume' it, even if only indirectly: another person creating their own benchmark from scratch may be influenced by yours, even if the new question sets are clean-room. That, and the rule of thumb that benchmark value dissipates like sqrt(age) [0]
So there is a definite advantage to never publicizing your internal benchmark. But then, no one else can replicate your findings. You should assume that the space of benchmarks that are actually decent at evaluating model performance is much larger and most of the good ones, the ones that were costliest to produce, are hidden, and might not even correspond very well with the public ones. And that the public expensive benchmarks are selective and have a bias towards marketing purposes.
For me, GitHub CLI is the prime example of this. This CLI is so incredibly powerful when combined with regular command line tools. Agents know how to use head, tail, jq and so on to only extract the parts it needs.
The best selling point of CLIs is the ability to chain, transform and combine. MCP cannot do this.
I agree. If you strictly follow the syntax of "Example 1" in JavaScript (calling and awaiting on the same line), the observable output is identical to Python.
I suppose the author meant to say that if you first called your async function and then later did `await` you would have different behavior.
In every language having any kind of asynchronous features you should get exactly same result. Other comments already mentioned how the example should look and how it differs.
In short: having other coroutine working and awaiting e.g. on sleep() you can get anything between „parent before” and „child start”. In Python is impossible, because child is not run as new task.
I've been thinking a lot about building an open source dating app as a non-profit offering.
I have a sense that succesful dating contributes highly to overall human happiness. It should be a public service similar to wikipedia or libraries.
Free forever, fair and safe, and responsibly managed. It's probably not that expensive to run. But idunno, i'm kinda frightened to "compete" in this market
As I understand it, it's not a technical problem, rather a social one first off: you can build it but it'll be "empty" compared to all other options out there, even if it's technically superior to them. Network effect and all that.
There's also a technical problem you'll have to contend with: bots and scammers... so many bots and so many scammers.
I think it's an interesting area, but I've got no time or energy to undertake such an endeavor. However, I'd be happy to talk about it and discuss it further if you'd like to. Contact info is on my profile page here.
I think you should do it. The costs for all these services are still priced like the AOL days where bandwidth mattered. I really don’t think the hosting costs could be much. I had a small dating site decades ago and the cost was almost nothing.
I've watched speed dating events go from free to $45 in the past couple years. Not sure if that's b/c of inevitable factors in running those events or pure opportunism.
I think something like the matrix protocol would be better. I would be especially interested in not storing unencrypted user messages. Matrix would be a good choice for this.
>While I respect anyone’s decision to spend their days playing pickleball, that life isn’t quite for me—at least not full time. I’m lucky to wake up every day energized to go to work
Bit of an unfair comparison though.. Most people dont retire from a job where you're literally handing people money.
That said, I'm a huge fan Bill's work post-microsoft :)
I don't need you to repeat the propaganda, I know what the official narrative is. And it's mostly lies, especially the part about Gates suddenly turning from asshole to saint.
Interesting. I would not have thought, that on hn a normal discussion with arguments and sources is not possible. But hey you seem to pref to comment for the comment sake not for discussion.
First of, i do know his history very well, i'm quite aware that he was not a saint before but that doesn't change the fact what he is currently doing and it is very good.
Do you have anything real? Like real talking points? Real sources? aything besides just shitting at him?
Ahaha, I've become that person I guess. I only mentioned Arch as I've always used Ubuntu when using Linux desktop VMs, and even test drove Kubuntu before trying out Cachy. Apart from some brief time getting used to pacman as a package manager instead of apt, I haven't encountered any other items that felt different to Ubuntu.
Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?