Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.

E.g. the ALBERT model does that:

https://arxiv.org/abs/1909.11942

I have done model distillation of XLM-RoBERTa into ALBERT-based models with multiple layer groups and for the tasks that I was working on (syntax) it works really well.

E.g. we have gone from a finetuned ~1000MiB XLM-R base model to a 74MiB ALBERT-based model with barely any loss in accuracy.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: