Can you tell the current voice model what feelings and tone it should communicate with? If not it isn't even comparable, being able to control how it reads things is absolutely revolutionary, that is what was missing from using these AI models as voice actors.
+1. Check the demo video in OP titled "Sarcasm". Human asks GPTo to speak "dripping in sarcasm". The tone that comes back is spot on. Comparing that against current voice model is a total sea change.