Yes, although I believe this is a speaker embedding model here, so not LLM related.
This kind of speech clustering has been possible for years - the exciting point with their model here is how it's highly focused on accents alone. Here's a video of mine from 2020 that demonstrated this kind of voice clustering in the Mozilla TTS repo (sadly the code got broken + dropped after a refactoring). Bokeh made it possible to directly click on points in a cluster and have them play
This kind of speech clustering has been possible for years - the exciting point with their model here is how it's highly focused on accents alone. Here's a video of mine from 2020 that demonstrated this kind of voice clustering in the Mozilla TTS repo (sadly the code got broken + dropped after a refactoring). Bokeh made it possible to directly click on points in a cluster and have them play
https://youtu.be/KW3oO7JVa7Q?si=1w-4pU5488WxYL3l
note: take care when listening as the audio level varies a bit (sorry!)