These comments:
> Now do the same for other evaluations, remove the o family, nudge the time scale a bit, and watch the same curve pop out.
> This is called eval saturation, not tech singularity. ARC-2 is already in production btw.
A reply:
> You act like that isnt significant, people just hand wave "eval saturation"
> The fact that we keep having to make new benchmarks because ai keep beating the ones we have is extremely significant.
Agree with this. The pace has been mind boggling.
The pace has been impressive, but until hallucinations are addressed the more faith and capital you put behind an AI agent, the more you risk losing.
These comments:
> Now do the same for other evaluations, remove the o family, nudge the time scale a bit, and watch the same curve pop out.
> This is called eval saturation, not tech singularity. ARC-2 is already in production btw.
A reply:
> You act like that isnt significant, people just hand wave "eval saturation"
> The fact that we keep having to make new benchmarks because ai keep beating the ones we have is extremely significant.
Agree with this. The pace has been mind boggling.