On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)
Probably openai will be >60% in three months if not immediately with these $1000/question level compute (which is the way tbh we should throw compute whenever possible that's the main advantage of silicon intelligence)
Their own admission that intelligence is a meaningless metric without bound on compute is one of the main reasons AI will overpower human intelligence soon. Simple scaling is very effective.
i mean they have a verifier, so can't they even get to 90% just by random generation by the net and testing against verifier until it's numerically correct? I think the end solve rate is less important and the generality of approach is maybe more important
but i mean it depends on how many attempts you let it generate. the right comparison is to use the test time rl compute to just do generation and compare success rates. (if you gen for long enough you eventually will hit the answer by chance)