Yes, although I believe this is a speaker embedding model here, so not LLM relat...

Yes, although I believe this is a speaker embedding model here, so not LLM related.

This kind of speech clustering has been possible for years - the exciting point with their model here is how it's highly focused on accents alone. Here's a video of mine from 2020 that demonstrated this kind of voice clustering in the Mozilla TTS repo (sadly the code got broken + dropped after a refactoring). Bokeh made it possible to directly click on points in a cluster and have them play

https://youtu.be/KW3oO7JVa7Q?si=1w-4pU5488WxYL3l

note: take care when listening as the audio level varies a bit (sorry!)