Genuine question, are these companies just including those "obscure" problems in their training data, and overfitting to do well at answering them to pump up their benchmark scores?
o3-pro, gpt5-pro, gemini 2.5-pro, etc. still can't solve very basic first-principles math problems that just rely on raw thinking, no special tricks. I think personally because it's not in its training data - if I inspect their CoT/reasoning, it's clear to me at the very least that they're just running around in circles applying "well known" techniques and just hoping that it applies (without actually logically verifying that it does). Very inhuman reasoning style (that's ultimately incorrect). It's like somebody was taught a bunch of PhD level tricks but has the actual underlying reasoning of a toddler.
I wonder how well their GPT-5 IMO research model would do on some of my benchmark problems.