Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models

I found this statement from the paper to be at odds with what you cited, but I guess they mean SFT+RL would be better than either just SFT and RL



I think they're saying that some reasoning patterns which large models can learn using only RL (i.e. without the patterns existing in the training data), can't be learned by smaller models in the same way. They have to be 'taught' through examples provided during SFT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: