Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's because BLOOM is undertrained, you can prune a lot of weights in BLOOM and it doesn't impact performance. Look at Chinchilla paper[1], 70B model outperforms 175B GPT-3 model.

https://arxiv.org/abs/2203.15556



In general, most giant LLMs are extremely undertrained at this time. Consider that most of the gains in RoBerta vs bert were from just continuing to train.


Cases of undertraining can be observed whenever the output is repeating gibberish or loops. Happened a lot in GPT2 ai dungeon days


So can we continue training RoBERTa to get it to, say, GPT3 Ada level




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: