I came here to say this. My brother has a PhD in chemistry and no coding experie...

audiohermit · on June 11, 2020

I work in pathological speech processing/synthesis so I'm unfortunately familiar with your father's position. It really sucks that these people didn't know that archiving their voice would've been useful. I hear snippets that people manage to glean from family videos right after listening to their current voices and it makes me really sad.

On the upside, your father can choose any celebrity he wants to voice him! Tons of celeb data is publicly available (VoxCeleb 1 & 2).

clan · on June 11, 2020

Are there any simple howtos anywhere which describes the process in as simple terms as possible? Without knowing the cool toolkits du jour.

Something like: - Download these texts - Record in WAV at least 48 kHz - Record each line in a separate file. - Do 3 takes of each line: flat, happy, despair

Maybe even a minimal set and a full set depending on how much effort you are willing to put in.

A plain description on how to capture a raw base which within reason and technology could be used as a baseline for the most common toolkits.

I have myself looked into this (for fun) but I felt I needed a very good understanding of the toolkits before even starting to feed in data. And for my admittedly unimportant use it seemed a huge investment to create a corpus I was not even confident would work. I ended up taking the low road and used an existing voice.

audiohermit · on June 11, 2020

Not really, this is the only thing I know of in terms of collection: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/24... Usually you're basing your recipe off of those for existing datasets (TIMIT, WSJ, LibriSpeech, etc).

voicevoice50 · on June 12, 2020

For recording training audio:

https://github.com/daanzu/speech-training-recorder

The recorder works with Python 3.6.10. Need to pip install webrtcvad also.

vervez · on June 11, 2020

Is Morgan Freeman the most used celebrity?

Deinos · on June 11, 2020

A ranking of used voices would be fascinating. Especially broken down by user statistics.

core-questions · on June 11, 2020

I'd go for Stephen Hawking, myself.

(Not using his voice synth, reconstructed using ML, because it should sound more natural that way ;-)

shagie · on June 11, 2020

I recall that the "say" program on the SGI from the mid 90's was approximately Hawking's voice. Hawking gave his speech for the Whitehouse Millennium Lecture at SGI also, and while I wasn't able to attend I found the transcript of it and fed it in there... there were some jokes that he had that only really came through with the intonation and pacing of a voice synth -- its the ultimate dead pan voice.

https://clinton.presidentiallibraries.us/items/show/16112 https://youtu.be/orPUQm1ZRSI

And his voice was his - even with the American accent.

https://www.news.com.au/technology/innovation/why-stephen-ha...

> “It is the best I have heard, although it gives me an accent that has been described variously as Scandinavian, American or Scottish.”

> ...

> “It has become my trademark and I wouldn’t change it for a more natural voice with a British accent.

> “I am told that children who need a computer voice want one like mine.”

Somewhere, I recall a NOVA(?) program from the mid 80s where it showed him using the speech synthesizer and the thing that he said with it that still sticks in my mind is the "please excuse my American accent". In later years he was given the opportunity to upgrade it to a more natural sounding voice - but that voice was his.

egypturnash · on June 12, 2020

Near the end of his life, his original voice computer started to fall apart. He managed to get in touch with the people who wrote the software, who started a mad scramble to find source, and ultimately ended up emulating the whole setup on a Pi.

https://theweek.com/articles/769768/saving-stephen-hawkings-...

hapless · on June 12, 2020

Hawking’s original voice synth was the default sound of the DECtalk hardware speech synth.

It would not surprise me if SGI’s software implementation were similar to the he most popular hardware of the 1980s.

grogenaut · on June 12, 2020

Unfortunately dad passed 21 years ago. But the options now are much better. Just projecting my past experiences on the obvious Delta.

dheera · on June 11, 2020

Which generator works the best, qualitatively? I come from a vision/ML background but haven't played with speech at all, so it's completely new to me, and wondering what the state of the art is.

I've been wanting to create a TTS of myself so I can take phone calls using headphones and type back what I want to say so that I don't have to yell private information out loud in public locations. Would be nice if during non-COVID times I could sit in a train seat and take phone calls completely silently.

audiohermit · on June 11, 2020

Much of the work in speech synthesis has been about closing the gap in vocoders, which take a generated spectrogram and output a waveform. There's a clear gap between practical online implementations and computational behemoths like WaveNet. As you implied it's hard to quantitatively judge which result is better, papers usually use surveys to judge.

Here's a recent work that has a good comparison of some vocoders: https://wavenode-example.github.io/

Edit: WaveRNN struck a good balance for me in the past but is not shown in the link. Tons of new work coming out though!

sdenton4 · on June 11, 2020

WaveRNN (and even slimmer versions, like LPCNet) are great, and run for a tiny fraction of the compute of the original WaveNet. Pruning is also a good way to reduce model sizes.

I'm not sure what's up with the WaveGLOW (17.1M) example in the linked wavenode comparison... The base WaveGLOW sounds reasonable, though. They're also using all female voices, which strikes me as dodgy; lower male voice pitch tracking is often harder to get right, and a bunch of comparisons without getting into harder cases or failure modes makes it seem like they're covering something up.

(I've run into a bunch of comparisons for papers in the past where they clearly just did a bad job of implementing the prior art. There should be a special circle of hell...)

audiohermit · on June 11, 2020

Agreed. I didn't have a better comparison at hand.

I'm looking at you GAN papers.

tombert · on June 11, 2020

This sounds pretty cool (your brother making the voice model, not your dad losing the voice)...do you have a link to this example? I would love to play with this.