Essentially instead of tokens that are "already there" in text, the distillation allows us to simulate training data from a larger model
Essentially instead of tokens that are "already there" in text, the distillation allows us to simulate training data from a larger model