The big deal here isn't that R1 makes any other models obsolete in terms of performance, but how cheap it is $2 vs $60 per million output tokens compared to O1 (which it matches in benchmark performance).
O1 vs R1 performance on specific non-benchmark problems is also not that relevant until people have replicated R1 and/or tried fine-tuning it with additional data. What would be interesting to see is whether (given the different usage of RL) there is any difference in how well R1 vs O1 generalize to reasoning capability over domains they were not specifically trained for. I'd expect that neither do that well, but not knowing details of what they were trained on makes it hard to test.