Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

LoRA training/merging basically is "crank up the batch size ridiculously high" in a nutshell, right? What actually breaks when you do that?


Cranking up the batch size kills convergence.


Wonder if that can be avoided by modifying the training approach. Ideas offhand: group by topic, train a subset of weights per node; figure out which layers have the most divergence and reduce lr on those only.


A provable way to recover convergence is to calculate the hessian. It’s computationally expensive but there are approximation methods.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: