This is amazing, but one thing I don't really understand is this: earlier in the...

Klathmon · on May 8, 2018

I have a feeling it's because this is such a limited domain.

During the demo it all sounded very realistic, except for some parts like the times. It would flow naturally then all of a sudden pause awkwardly and then say a time like "12 pm" in a weird way.

I have a feeling they are getting it to sound so realistic because there's a fairly small amount of responses and questions it needs to work with, so they can either pre-record real humans, or heavily tune a ML voice to sound as natural as possible.

ksk · on May 8, 2018

Hmm, or maybe its because you can run the synthesis on a much powerful computer? Certainly very impressive..

hjnilsson · on May 9, 2018

It's possible all the questions are even prerecorded human voices. The corpus is probably less than a 1000 phrases.

DecayingOrganic · on May 8, 2018

This is simply because Google Duplex's TTS engine is using Tacotron and WaveNet which are not ready for general use yet.

halflings · on May 9, 2018

WaveNet is already used in the Assistant and Google Translate at least. The new voices they announced are powered by WaveNet.

dharma1 · on May 9, 2018

I think all WaveNet speech is still being generated server side instead of on the client hw? So there is a cost to associated to it.

If Google Duplex is a paid product, maybe it just enables running WaveNet on Google Cloud with larger models and higher quality settings .

Speech produced for Assistant doesn't make any money so the server side cost has to be minimised.

One day we'll have all this client side, on specialised ML chips on devices.

projektfu · on May 9, 2018

I'm reminded of a thing I read about generating "yellow letters" for real estate leads. The letters look like someone handwrote them on a steno pad. If you used a handwriting font they would look fake. But you can't handwrite each one, that'd take to long for thousands of letters. What they do is get a handwriting font made of their handwriting, and sure it looks fake. But they write most of the letter template out by hand and use the handwriting font for just the parts that change. I wonder if something similar is going on here, where they have taught the speech synthesis a bunch of phrases and they color in the bits with synthetic speech where it's needed. The inflection is also easier to specify in those cases.