LoRA training/merging basically is "crank up the batch size ridiculously high" i...

brrrrrm · on March 23, 2024

Cranking up the batch size kills convergence.

FeepingCreature · on March 23, 2024

Wonder if that can be avoided by modifying the training approach. Ideas offhand: group by topic, train a subset of weights per node; figure out which layers have the most divergence and reduce lr on those only.

brrrrrm · on March 25, 2024

A provable way to recover convergence is to calculate the hessian. It’s computationally expensive but there are approximation methods.