The whole value of this model is in 680 000 hours of training data and to reuse this value you need large model, not smaller ones. Smaller versions just don't have enough capacity to represent training data properly.
I get that. I'm saying the medium.en model specifically seems to have some weird edges to its behavior that is not present in the models up or down the scale from it, or similarly (the plain 'medium' model).
It's the only one that seems to be occasionally spitting out significant chunks of training data versus something that resembles the audio.