Project Common Voice

albertzeyer · on July 18, 2017

The terminology is a bit confusing. They are saying that they want to build voice recognition but it seems like they actually might want to build a speech recognition engine. Speech recognition is about recognizing the speech, the spoken words. Voice recognition is about recognizing the speakers voice, i.e. identifying the speaker. Also, maybe they also want to build a text-to-speech (TTS) system but I'm not sure.

No matter what, the collected data might be useful for all of that, maybe except of voice recognition actually, because I guess the data will be collected anonymously?

Note that there are some other existing big open speech corpora such as LibriSpeech (http://www.openslr.org/12/) which could already be used right now to build a quite good speech recognition system.

punchingwater · on July 18, 2017

I can tell from your comment (and it's responses) that the language on our homepage is a bit confusing, so thank you for the feedback.

To answer you question: Common Voice is about building a collection of labelled voice data (ie. sentence clips w/ transcripts) that can be used to, for instance, train speech-to-text algorithms. Part of the goals of this project though is to figure out how this data can best help people build voice technology. So it's pretty open ended at this point.

Mozilla does have an open source speech-to-text engine [1] we are developing, and we hope one day to use the Common Voice data to train this engine. DeepSpeech and Common Voice are related, but separate projects, if that makes sense.

As for LibriSpeech, the DeepSpeech team at Mozilla does use this data for training. However, the language is pretty antiquated, and we only get about 1K hours of data, whereas you need about 10K hours to get to a decent accuracy (WER of 10% and below). Common Voice is about adding to public corpora like LibraSpeech, not replacing them.

1.) https://github.com/mozilla/DeepSpeech

albertzeyer · on July 18, 2017

Very interesting. I was not aware that there is Mozilla DeepSpeech (which implements the model from the same called paper DeepSpeech by Baidu, in TensorFlow). Note that the issue with DeepSpeech (the CTC model from the Baidu paper) is that it really needs a lot of training data to perform well (that is a generic property of CTC). If you use more conventional models (hybrid NN/HMM models), you can get very decent word-error-rate performance with only a few hundred hours of data. The advantage of DeepSpeech of course is that it is simpler and you don't need a lexicon (mapping words to its pronunciations, i.e. sequence of phonemes).

I would also not use voice technology as the generic term for speech recognition, text-to-speech, and whatever else you want to do with this data. Rather, speech technology is the common term to cover all of this (https://en.wikipedia.org/wiki/Speech_technology).

punchingwater · on July 18, 2017

Noted. Again thanks for the feedback :)

giancarlostoro · on July 18, 2017

Most people see "voice recognition" and think "speech recognition" I am guilty of this having never looked further into the term. I probably saw speech recognition and voice recognition and assumed they meant the same. At the very least now I know there's specifics to each term but I have a feeling most people don't. Maybe whoever wrote the page isn't an actual developer but just a volunteer writer?

john2x · on July 18, 2017

Could they use the different voices to generate unique, natural-sounding voices for text-to-speech?

ancarda · on July 18, 2017

I really hope so.

All the text-to-speech software I've used has a generic sounding accent for the country (your choices are typically American, Australian, British, Canadian) but there's a lot more accents out there. The software isn't bad - it sounds realistic - but I wish it sounded more how I would like it to.

There's some software, e.g. Cepstral Dallas - https://www.cepstral.com/en/demos but it sounds too robotic to actually use and that voice isn't available for Linux so I only have it installed on my MacBook.

I guess a lot of developers at e.g. Apple live in CA so Siri is probably influenced by that.

amelius · on July 18, 2017

> Speech recognition is about recognizing the speech, the spoken words. Voice recognition is about recognizing the speakers voice, i.e. identifying the speaker.

Perhaps they want to do both eventually (?) That could explain the name.

punchingwater · on July 18, 2017

Common Voice is only about collecting a large public database of voices. We do have a separate project around speech-to-text [1]. We haven't done much work around speaker recognition (AFAIK) or voice synthesis, but they are both very interesting both from a technical and privacy related standpoint. That said, both are out of the scope of Common Voice (which is only about the data).

1.) https://github.com/mozilla/DeepSpeech

apeddle · on July 18, 2017

This looks great! I use voice control to program on occasion due to an rsi injury. The standard stack for this is a mess due to closed source systems that aren't designed for voice programmers. A good open solution could really save me from a lot of headaches.

qznc · on July 18, 2017

Previously, there was VoxForge [0], but it seems dead. At least, I failed to contribute my voice there. Mozilla getting into this space is good news indeed.

[0] http://www.voxforge.org/home/read

oulipo · on July 18, 2017

You can take a look at what we build at https://snips.ai, we will open-source the platform later this year

Jayakumark · on July 18, 2017

Cool. Just curious on What is the voice engine behind snips ? and who is the provider of training data ?. Also do you have plans for supporting additional languages or can it be trained on when you open source it ?

cooper12 · on July 18, 2017

If they're planning to make a voice recognition system, why are they using example statements that are clearly taken from novels? [0] That's not how real people talk. They use a lot more slang, a lot more stopping and starting, filler words, etc. Instead you have people saying things like "irresolute", "rumbling", and other complex words. It would be useful for training a novel dictation system, but it's not how people would speak to their browser for example.

[0]: An example sentence is "a thin circle of bright metal showed between the top and the bottom of the body of the cylinder", which is from H. G. Wells' War of the Worlds.

jpalomaki · on July 18, 2017

Maybe there's not yet good open datasets available for this kind of material?

This gives Amazon, Apple and Google a nice advantage since they are able to collect huge sample sets of actual voice commands used by people and to some extent also correlate them with the actual action taken by the person.

How could we collect such dataset? It's a bit chicken-egg problem. I don't want to talk to some open source system unless it has fairly good chance of understanding me. Should we try to half manually (through crowd sourcing) come up with potential requests like "Check news from CNN.com", "Order me quattro stagioni" which could be then fed to platform like Common Voice?

Or should we work on higher level. Come up with task descriptions ("You want to order taxi to get to airport for your morning flight at 7am") and then let people record how they would actually request this from computer with voice. This might more accurately capture the language we actually use when speaking. Through some simple automation you could generate variations of the requests and at least partly the same base material could be used for different languages (task given in English, ask person to make the request in Finnish).

saurik · on July 18, 2017

If you want people carefully reading books, it is pretty easy to get a hold of that kind of data in the form of audio books and the work of Recording for the Blind and Dyslexic. Sure, it isn't chunked into sentences, but since you have all of the source text you could do a quite reasonable job automating the slicing, throw out places you aren't sure, and still have a near infinite amount of great data. (Note that it isn't like these sentences are perfect anyway, hence the filtering process with volunteers: while I was judging some audio files one of the issues was "person turned off microphone a little too soon".)

icc97 · on July 18, 2017

Perhaps that's one of the points of using text from books. You can compare how people are speaking compared to someone who was specifically tasked with reading the book out loud for the audio book.

mbebenita · on July 18, 2017

That's correct. There isn't a good open dataset for that type of material. Perhaps we should try to build one.

glandium · on July 18, 2017

Sadly, in Demographic Data, only native english accents can be selected.

a3_nm · on July 18, 2017

I have reported this and it looks like they intend to fix this https://github.com/mozilla/voice-web/issues/242

breakingcups · on July 18, 2017

If I read the issue right they don't intend to fix it at all.

pebers · on July 18, 2017

Is the data going to be freely available as well? It's a little unclear whether they intend to make it separately available or not.

abetusk · on July 18, 2017

I agree that this should be documented better but looking at their terms of service, the recordings submitted look to be under CC0 [1].

The relevant blurb:

    Your Contributions and Release of Rights
    
    By submitting your recordings, you waive all copyrights and
    related rights that you may have in them, and you agree to
    release the recordings to the public under CC-0. This means
    that you agree to waive all rights to the recordings
    worldwide under copyright and database law, including moral
    and publicity rights and all related and neighboring rights.

[1] https://voice.mozilla.org/terms

hardwaresofton · on July 18, 2017

It looks like the database will be open sourced later this year: https://voice.mozilla.org/faq

I'm wondering if the format will be easily translatable to the kinds of models that software like CMUSphinx and Julius use

https://cmusphinx.github.io/

http://julius.osdn.jp/en_index.php

blackkettle · on July 18, 2017

I would assume they are only going to be doing the raw data collection and maybe cleanup and annotation, and that data should be made available so you can train what you like.

If you poke around github and the Kaldi lists a bit more you can see that they are experimenting with and probably planning to use Kaldi.

I wonder what they plan to do for provisioning. It is one thing to collect data and train models, but quite another to make the service available over the web in an unlimited capacity. And we are not yet to the point where you can reasonably expect to run a high quality open-vocabulary STT system in your browser. The search network is typically in the GBs range.

lucb1e · on July 18, 2017

It's weird to me that they publish the project about having an open dataset of voice data, with only a promise the open it up later.

zie · on July 18, 2017

It's Mozilla, they likely just don't have the infrastructure/code in place yet. They definitely get my trust for believing they actually will. They have been very good in the past about keeping things open. They make mistakes sometimes, but their goals are all about being open.

punchingwater · on July 18, 2017

thanks for the vote of confidence! yes we will absolutely open this data up, and it's just a matter of collecting enough data to be useful, and then building the UI. we have a goal of achieving this by the end of 2017, so stay tuned!

punchingwater · on July 18, 2017

Great feedback, we can look into clarifying on our homepage that our entire goal is to create a dataset in the public domain. We want people to donate not just to Mozilla, but to the world :)

mbebenita · on July 18, 2017

Yes, the data will be made freely available.

therealunreal · on July 18, 2017

Any plans for languages other than English?

a3_nm · on July 18, 2017

This seems planned: https://github.com/mozilla/voice-web/issues/213

jldugger · on July 18, 2017

And... 503'd. I didn't catch what the intended use case was before it died, but I'm guessing computer generated voice?

Most of the computer generated stuff I've seen uses trained actors. Which neatly avoids the problem of trying to reconcile a myriad of accents and dialects, which was immediately apparent from the first two samples I tried.

edit: back up, seems to be about voice recognition, which this could help with no problem.

popinman322 · on July 18, 2017

Actually, based on the site content I think they're using it to create an archive of speech data to train speech recognition systems.

mbebenita · on July 18, 2017

That is correct. The DeepSpeech project, https://github.com/mozilla/DeepSpeech will use this data to train and validate open source / freely available speech to text models. The training data, along with the trained models will be made available for free to all users and researchers alike.

eriknstr · on July 18, 2017

Site works fine from my location.

I think you are correct.

> Read a sentence to help our machine learn how real people speak. Check its work to help it improve. It’s that simple.

punchingwater · on July 18, 2017

Sorry about the 503s! We were adding servers to our cluster to handle the hacker news load, and a few 503s are hard to avoid. If this is consistently happening for you, please file a bug and we'll look at it.

https://github.com/mozilla/voice-web/issues

eatbitseveryday · on July 18, 2017

It would be useful to collect data from non-native speakers of a language. More and more such individuals are appearing in all countries, and devices that accept spoken words should not break because of someone's level of command of a spoken language. For example, a Swiss speaking German (Hochdeutsch), or more clearly, a Brit speaking French, etc. Some children who grow up in multi-lingual families also intermix words from multiple languages into their sentences. We can still understand them.

punchingwater · on July 18, 2017

This is a bug with our website [1]. We actually are trying to collect non-native speakers (as well as native). We are looking into clarifying this on the site.

1.) https://github.com/mozilla/voice-web/issues/242

giancarlostoro · on July 18, 2017

I wonder if implementing a new type of Recaptcha with these type of projects in mind would make sense. The data wouldn't be going to some data center in Google land, but instead to some open end that anyone should be able to get their hands on. Also a free and open source recaptcha alternative would be nice. Trick is keeping it complex enough that bots cannot just reuse the existing public data set. Maybe withhold on making some of the data public for a few years till deemed 'retired'.

ZoomZoomZoom · on July 18, 2017

I hope this data will be used purely for voice recognition purposes and not for voice generation, or we'll be stuck with robots talking with this horrible gurgling and clicking accent due to poor recording conditions of most participants!

sirlantis · on July 18, 2017

According to their FAQ they actually want those poor conditions to be present in the corpus.

> We want the audio quality to reflect the audio quality a speech-to-text engine will see in the wild. Thus, we want variety. This teaches the speech-to-text engine to handle various situations—background talking, car noise, fan noise—without errors.

https://voice.mozilla.org/faq

tangue · on July 18, 2017

Cool project, really aligned with the mission of Mozilla, and with a pleasant UX. And if you're a non-english speaker like me validating sentences is a nice way of improving your comprehension.

TomasSedovic · on July 18, 2017

Yep! Though us non-native speakers should really be recording, too. So we're not left behind in voice recognition.

punchingwater · on July 18, 2017

Exactly! Part of the goals of Common Voice is to make voice recognition work better for non-north american men (which is where the vast majority of the training data comes from).

If you are a non-native speaker, we need your voice!

tangue · on July 18, 2017

Okay but what should strangers submit for country and accent in the form ? The only options are anglophones countries.

halomru · on July 18, 2017

The accent you think you are speaking (or are emulating)

https://github.com/mozilla/voice-web/issues/242

punchingwater · on July 18, 2017

Good question. Sounds like we should add an "Other" to that drop down, and make it clear that we are looking for all accents?

tangue · on July 18, 2017

Yes - imho "non native english speaker" would be we the field where I would look first- edit typed on phone ...

sexydefinesher · on July 18, 2017

Should i have any privacy concerns with contributing? I dont want just anyone to have the data to recreate my voice digitally.

olegkikin · on July 18, 2017

Man, most people have horrible microphones.

mbebenita · on July 18, 2017

Indeed, and a big problem for STT models, which is why we're trying to collect this type of data.

timwaagh · on July 18, 2017

this is an important development. voice control has good potential. would be cool if they used it as an alternative way to control firefox and/or servo?

ibotty · on July 18, 2017

Any idea why the duplicate detection did not work for this link: https://news.ycombinator.com/item?id=14786881

Anyhow: these should be merged (even though there is no discussion on the other submission)

pvinis · on July 18, 2017

I think that's why. When the previous thread is not very active, dup's are allowed. Not sure though.

tomhoward · on July 18, 2017

Yes. The system is designed to allow multiple chances for good content to get exposure.

https://hn.algolia.com/?query=dang%20deliberately%20porous&s...

kgdinesh · on July 18, 2017

Can't believe it's down already.