This is a pretty hyped-up optimizer that seems to have okay-ish performance in-p...

aarondefazio · on April 7, 2024

The behavior is actually more complex than a 1/t schedule. It behaves like a linear decay schedule 1-t/T with fixed stopping time T, as if T had been chosen in advance as the current timestep. When warmup is included, this is similar to high performance triangular learning rate schedules. Schedules of the form 1/t schedules perform really poorly in practice, we actually did a large scale comparison that included them in a prior paper: https://arxiv.org/pdf/2310.07831.pdf

danielhanchen · on April 7, 2024

My main current concerns are I tried asking for a transformer benchmark to see if this worked on transformers, but didn't get any response. Also they seem particularly focused on CNN type benchmarks, but did not bother to benchmark superconvergence + Ranger21 + the learning rate range finder, since they explicitiy said Schedule-Free needs tuning as well.

Their past research on D-Adpatation (won ICML best paper 2023) and their follow up work Prodigy all did worse / similar than AdamW, so maybe this works on CNNs, but does not on transformers - but for CNNs we have superconvergence.

I shall wait for their paper which will come in 1-2 months.