The R1 paper used o1-mini and o1-1217 in their comparisons, so I imagine they ne...

The R1 paper used o1-mini and o1-1217 in their comparisons, so I imagine they needed to use lots of OpenAI compute in December and January to evaluate their benchmarks in the same way as the rest of their pipeline. They show that distilling to smaller models works wonders, but you need the thought traces, which o1 does not provide. My best guess is that these types of news are just noise.

[edit: the above comment was based on sensetionalist reporting in the original link and not the current FT article. I still think there is a lot of noise in these news this last week, but it may well be that openai has valid evidence of wrongdoing; I would guess that any such wrongdoing would apply directly to V3 rather than R1-zero, because o1 does not provide traces and generating synthetic thinking data with 4o may be counterproductive.]