That could just be the context though. Listening to a clip that's a demo of what...

onion2k 6 months ago | parent | context | favorite | on: Pushing the frontiers of audio generation

That could just be the context though. Listening to a clip that's a demo of what the model can produce is very different to listening to a YouTube video that's using the model to generate speech about something you'd actually want to watch a video of.