Did you even watch the video ?
It's just baffling how I have to spell this out.
Skip to 11:50 or watch the very first demo with the breathing. None of that is possible with TTS and STT. You can't ask old voice mode to slow down or modulate tone or anything like that because it's just working with text.
Yes I watched the demo. True those things were not possible, so if that’s what’s blowing you away then fair enough I guess. For me that doesn’t impact at all anything have ever used voice for or probably will ever use voice for.
I’ve voice chatted with ChatGPT for hundreds of hours and never once thought “can you modulate your tone please?”, so those improvements are a far cry from magic or revolutionary imho. Again, that’s not to say they aren’t cool tech, forward advancements, or impressive —- but magic or revolutionary are pretty high bars.
Few people are going to say "modulate your tone" in a vacuum sure but that doesn't mean that ability along with being able to manipulate all other aspects of speech isn't an incredible advance that is going to be very useful.
Language learning, audiobook narration that is far more involved, you could probably generate an audio drama, actual voice acting, even just not needing to get all my words in before it prompts the model with the transcribed text, conversation that doesn't feel like someone is reading a script.
And no, thumbing the pause button, sending an image and going back does not even begin to compare in usability.
Great leaps in usability are a revolution in itself. GPT-3 existed for years so why did ChatGPT explode when it did? You think it was intelligence? No. It was the usability of the chat interface.
Skip to 11:50 or watch the very first demo with the breathing. None of that is possible with TTS and STT. You can't ask old voice mode to slow down or modulate tone or anything like that because it's just working with text.