Its a symptom of asking the models to provide answers that are not exactly in the training set, so the internal interpolation that the models do probably runs into edge cases where statistically it goes down the wrong path.
This is exactly it, it’s the result of RLVR, where we force the model to reason about how to get to an answer when that information isn’t in its base training.