Interesting that SoundStorm was trained to produce dialog between two people using transcripts annotated with '|' marking changes in voice. But the exact same '|' characters seem to mostly work in the Bark model out of the box and also produce a dialog?
Maybe a third or a bit more of Bark outputs are a dialog person talking to themselves -- and it often misses a voice change. But the pipe characters do reliably produce audio that sounds like a dialog in the performance style.
Is there some text-audio data somewhere in the training data that uses | for voice changes?
Amusingly, Bark tends to render the SoundStorm prompts sarcastically. Not sure if that's a difference in style in the models, or just Google cherry picking the more straightforward line readings as the featured samples.
The creators won't say as far as i know but bark looks to be trained on lot of youtube corpora (rather than typical ML audio datasets) where audio may have transcripts like that and why stuff like [laughs] work
In the future, will children think it's normal to talk like, "Hey, what up, Youtube! ... Be sure to like and subscribe! ... Smash that like button! ... Let me know in the comments down below!"?
I wonder how ML trained on the tone transitions to a sponsored segment dripping with secret shame... would infect general speech.
Yeah I often try to think about what might be in a YouTube caption when finding prompts that work in Bark. But pipe character isn't one I remember seeing on YouTube. Maybe it's part of some other audio dataset though. Or maybe it's on YouTube but only in non English videos.
Maybe a third or a bit more of Bark outputs are a dialog person talking to themselves -- and it often misses a voice change. But the pipe characters do reliably produce audio that sounds like a dialog in the performance style.
https://twitter.com/jonathanfly/status/1675987073893904386
Is there some text-audio data somewhere in the training data that uses | for voice changes?
Amusingly, Bark tends to render the SoundStorm prompts sarcastically. Not sure if that's a difference in style in the models, or just Google cherry picking the more straightforward line readings as the featured samples.