Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is amazing, but one thing I don't really understand is this: earlier in the presentation, they demoed some new Google Assistant voices. All of them sound like standard computer-synthesized assistants. On the other hand, the synthesized Duplex voices sound indistinguishable from human speech to me, even without the "disfluencies" they include.

If Google has gotten speech synthesis to this point, why isn't Assistant synthesizing speech of this quality?



I have a feeling it's because this is such a limited domain.

During the demo it all sounded very realistic, except for some parts like the times. It would flow naturally then all of a sudden pause awkwardly and then say a time like "12 pm" in a weird way.

I have a feeling they are getting it to sound so realistic because there's a fairly small amount of responses and questions it needs to work with, so they can either pre-record real humans, or heavily tune a ML voice to sound as natural as possible.


Hmm, or maybe its because you can run the synthesis on a much powerful computer? Certainly very impressive..


It's possible all the questions are even prerecorded human voices. The corpus is probably less than a 1000 phrases.


This is simply because Google Duplex's TTS engine is using Tacotron and WaveNet which are not ready for general use yet.


WaveNet is already used in the Assistant and Google Translate at least. The new voices they announced are powered by WaveNet.


I think all WaveNet speech is still being generated server side instead of on the client hw? So there is a cost to associated to it.

If Google Duplex is a paid product, maybe it just enables running WaveNet on Google Cloud with larger models and higher quality settings .

Speech produced for Assistant doesn't make any money so the server side cost has to be minimised.

One day we'll have all this client side, on specialised ML chips on devices.


I'm reminded of a thing I read about generating "yellow letters" for real estate leads. The letters look like someone handwrote them on a steno pad. If you used a handwriting font they would look fake. But you can't handwrite each one, that'd take to long for thousands of letters. What they do is get a handwriting font made of their handwriting, and sure it looks fake. But they write most of the letter template out by hand and use the handwriting font for just the parts that change. I wonder if something similar is going on here, where they have taught the speech synthesis a bunch of phrases and they color in the bits with synthetic speech where it's needed. The inflection is also easier to specify in those cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: