> it has the ability to judge correctness precisely, That's not possible from a ...

> it has the ability to judge correctness precisely,

That's not possible from a paper.

> it expresses a degree of surprise (low log likelihood?)

I think you're interpreting statistical terms too literally.

The truth of the matter is that we rely on a lot of trust from both reviewers and authors. This isn't a mechanical process. You can't just take metrics at face value[0]. The difficulty of peer review is the thing that AI systems are __the worst__ at and we have absolutely no idea how to resolve. It is about nuance. Anything short of nuance and we get metric hacking. And boy, you wanna see the degrade of academic works, the make the referee an automated system. No matter how complex that system is, I guarantee you human ingenuity will win and you'll just have metric hacking. We already see this in human led systems (like "peer review" and anyone that's ever had a job has experienced this).

I for one don't want to see science led by metric hacking.

Processes will always be noisy, and I'm not suggesting we can get a perfect system. But if we're unwilling to recognize the limitations of our systems and the governing dynamics of the tools that we build, then you're doomed to metric hack. It's a tale as old as time (literally). Now, if we create a sentient intelligence, well that's a completely different ball game but not what you were arguing either.

  You need to stop focusing on "making things work" and making sure they actually work. No measurement is perfectly aligned with ones goals. Anyone in ML that isn't intimately familiar with Goodhart's Law is simply an architect of Goodhart's Hell.

Especially if we are to discuss AGI, because there is no perfect way to measure and there never will be. It is a limitation in physics and mathematics. The story of the Jinni is about precisely this, but we've formalized it.

[0] This is the whole problem with SOTA. Some metrics no longer actually mean anything useful. I'll give an example, look at FID, the main metric for goodness of image generation. It's assumptions are poor (the norms aren't very normal and it's based on a ImageNet1k training which is extremely biased. And no, these aren't solved by just switching to CLIP-FID). There's been many papers written on this and similar for any given metric.