This is amazing, but one thing I don't really understand is this: earlier in the presentation, they demoed some new Google Assistant voices. All of them sound like standard computer-synthesized assistants. On the other hand, the synthesized Duplex voices sound indistinguishable from human speech to me, even without the "disfluencies" they include.
If Google has gotten speech synthesis to this point, why isn't Assistant synthesizing speech of this quality?
I have a feeling it's because this is such a limited domain.
During the demo it all sounded very realistic, except for some parts like the times. It would flow naturally then all of a sudden pause awkwardly and then say a time like "12 pm" in a weird way.
I have a feeling they are getting it to sound so realistic because there's a fairly small amount of responses and questions it needs to work with, so they can either pre-record real humans, or heavily tune a ML voice to sound as natural as possible.
I'm reminded of a thing I read about generating "yellow letters" for real estate leads. The letters look like someone handwrote them on a steno pad. If you used a handwriting font they would look fake. But you can't handwrite each one, that'd take to long for thousands of letters. What they do is get a handwriting font made of their handwriting, and sure it looks fake. But they write most of the letter template out by hand and use the handwriting font for just the parts that change. I wonder if something similar is going on here, where they have taught the speech synthesis a bunch of phrases and they color in the bits with synthetic speech where it's needed. The inflection is also easier to specify in those cases.
If Google has gotten speech synthesis to this point, why isn't Assistant synthesizing speech of this quality?