This is just a single generation pass though, and the voice is a bit unstable and is maybe not perfectly consistent with the direct speech, ideally you would tweak it a bit, maybe break it into chunks, isolate the direct speech and then cut it together, etc.
Still, it's pretty good I'd say, and as I mentioned previously, being able to give specific directions is around the corner, along with a "proper"/professional training of the voice model on a specific voice.
This is just a single generation pass though, and the voice is a bit unstable and is maybe not perfectly consistent with the direct speech, ideally you would tweak it a bit, maybe break it into chunks, isolate the direct speech and then cut it together, etc.
Still, it's pretty good I'd say, and as I mentioned previously, being able to give specific directions is around the corner, along with a "proper"/professional training of the voice model on a specific voice.