That could just be the context though. Listening to a clip that's a demo of what the model can produce is very different to listening to a YouTube video that's using the model to generate speech about something you'd actually want to watch a video of.