> 4.5/o3 doesn't seem hugely more intelligent then 3.0 I disagree with 3.0, but ...

> 4.5/o3 doesn't seem hugely more intelligent then 3.0

I disagree with 3.0, but perhaps that feels true for 4.0 or even 3.5 for some queries.

The reason is that when LLMs are asked questions whose answers can be interpolated or retrieved from their training data, they will likely use widely accepted human knowledge or patterns to compose their responses. (This is a simplification of how LLMs work, just to illustrate the key point here.) This knowledge has been refined and has evolved through decades of human experiments and experiences.

Domain experts of varying intelligence will likely come up with similar replies on these largely routine questions as well.

The difference shows up when you pose a query that demands deep reasoning or requires expertise in multiple fields. Then, frontier reasoning models like o3 can sometimes form creative solutions that are not textbook answers.

I strongly suspect that Reinforcement Learning with feedback from high-quality simulations or real environments will be key for these models' capabilities to surpass those of human experts.

Superhuman milestones, equivalent to those achieved by AlphaGo and AlphaZero between 2016 and 2018, might be reached in several fields over the coming years. This will likely happen first in fields with rapid feedback loops and highly accurate simulators, e.g. math problem solving (as opposed to novel mathematical research), coding (as opposed to product innovation).