The size reduction while keeping the model coherent is incredible. But I'm skeptical of how much effectiveness was retained. Flappy bird is well known and the kind of thing a non-reasoning model could het right. A better test would be something off the beaten path that R1 and o1 get right that other models don't.