Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are plenty of distills of reasoning models now, and they said in they livestream they used training data from "smaller models" - which is probably every model ever considering how expensive this one is.


Knowledge distillation is literally by definition teaching a smaller model from a big one, not the opposite.

Generating outputs from existing (therefore smaller) models to train the largest model of all time would simply be called "using synthetic data". These are not the same thing at all.

Also, if you were to distill a reasoning model, the goal would be to get a (smaller) reasoning model because you're teaching your new model to mimic outputs that show a reasoning/thinking trace. E.G. that's what all of those "local" Deepseek models are: small LLama models distilled from the big R1 ; a process which "taught" Llama-7B to show reasoning steps before coming up with a final answer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: