I don't think that's correct. As I understand it, O1 can use reinforcement learn...

I don't think that's correct. As I understand it, O1 can use reinforcement learning to optimize its chain of thought, but each individual "thought" is still subject to the performance of the LLM.

Therefore, while it can generate new strategies to approach a problem, the implementation of each step within the strategy is limited by a probabilistic approach.

Contrast that with AlphaZero, which can come up with strategies that are 100% unique (in theory), since it isn't constrained by any human training.

I think O1 is a step forward, but not a massive leap in technology.