Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

self distillation and mutual distillation are used in MoE models. What you can do is freeze all but one expert and then train the model. If you want to do it again, you have to do self/mutual distillation to spread the training result onto the other experts.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: