Any link / source / anything? You got quite an opportunity here, OpenAI employee claiming there's no difference and you got something that shows there is.
Codeforces is the same, but they have a footnote that they're using a different dataset due to saturation, but still have no grounding model to compare with