Just speculating, but this looks like precisely the sort of result one might exp...

Just speculating, but this looks like precisely the sort of result one might expect from fine-tuning models on chain-of-thought-prompted interactions like those described by Wei et al. in 2022[1]. Alternatively, as Madaan et al. show in their 2022 paper[2], this may simply be the result of larger language models having seen more code, and thus structured reasoning, in their training data.

[1] https://arxiv.org/abs/2201.11903

[2] https://arxiv.org/abs/2210.07128