> I think with ARC-AGI 2 this was the kind of compute budget they spent to get t...

> I think with ARC-AGI 2 this was the kind of compute budget they spent to get their models to perform on a human-ish level.

It was ARC-AGI-1 that they used extreme computing budgets to get to human-ish level performance. With ARC-AGI-2 they haven't gotten past ~30% correct. The average human performance is ~65% for ARC-AGI-2, and a human panel gets 100% (because humans understand logical arguments rather than simply exclaiming "you're absolutely right!").