*There's nothing preventing you from sharing weights across layers, and would be...

There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.

E.g. the ALBERT model does that:

https://arxiv.org/abs/1909.11942

I have done model distillation of XLM-RoBERTa into ALBERT-based models with multiple layer groups and for the tasks that I was working on (syntax) it works really well.

E.g. we have gone from a finetuned ~1000MiB XLM-R base model to a 74MiB ALBERT-based model with barely any loss in accuracy.