I first learned about CMUSphinx from the [Jasper Project](https://jasperproject.github.io/). While Jasper provided an image for the Pi, I decided to go ahead and make a scripted install of CMUSphinx. I spent something like 2 frustrating days attempting to get it installed by hand in a repeatable fashion before giving up.
This was 2 years ago, so maybe it's simple now, but I didn't find it "amazingly easy" back then.
I do have a number of projects where I could definitely use a local speech recognition library. I have used [Python SpeechRecognition](https://github.com/Uberi/speech_recognition/blob/master/exam...) to essentially record and transcribe from a scanner. I wanted to take it further, but google at the time limited the number of requests per day. Today's announcement seems to indicate they will be expanding their free usage, but a local setup would be much better. I'd like to deploy this in a place that might not have reliable Internet.
In my experiences, the issues with building CMU Sphinx are mainly unspecified dependencies, undocumented version requirements, and forgetting to sacrifice the goat when the MSVC redistributable installer pops up.
We've written detailed, up-to-date instructions [1] for installing CMU Sphinx, and now also provide prebuilt binaries [2]!
If you're interested in not sending your audio to Google, CMU Sphinx and other libraries (like Kaldi and Julius), are definitely worth a second look.
Yeah I'm gonna leave a reply here just in case I need to find this again (already opened tabs, but you never know). This might be big for a stalled project at work. If this can un-stall that, I'll sure owe you a beer ;)
Would you mind submitting this documentation to CMU? I get the feeling they'd love to at least host a link to them or something to enhance their own documentation?
That sounds like my experience with it from about 5 years ago or so. I gave up on it also. It also didn't help that CMUSphinx has had more than one version in development in different languages.
Unfortunately, the situation hasn't improved much. Besides, even if you get it set up, the quality of the recognition isn't even close to the one from Google.
As someone who's worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space. CMUSphinx and Julius are fine for low volume operations where you don't need really accurate response rates, but if you want high accuracy neither comes close from my experience.
As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I've tested.
Do you have a report of your tests? I'm interested in using speech recognition, but there are many start-ups and big players that it would be quite time consuming to get a quality/price analysis.
For the "dialect" of spanish that we speak in Argentina, Watson misses every single word. So, to me, CMUSphinx is valuable in that it allows me to tweak it, while IBM miserably fails at every word. Must've been trained with Spain or Mexican "neutral" spanish.
Googles engine also works fine (have been trying it with the phones), but the pricing may or may not be a deal breaker.
Is Julius really state-of-the-art? Looks like they use n-gram and HMMs. Those were the methods that achieve SotA 5+ years ago. My understanding is that Google and Microsoft are using end-to-end (or nearly) neural network models; these outperformed the older methods a few years ago. Not sure how CMUSphinx works under the hood.
They might not be considered state-of-the-art (if you consider both approaches in the same category), but they are definitely one valid approach to voice recognition, which works surprisingly well.
CMUSphinx is not a neural network based system, they do use language and acoustic modeling.
CMUSphinx is really easy to set up, and then being able to train it for one's specific domain probably beats state of the art with one-size-fits-all training.
> It is amazingly easy to create speech recognition without going out to any API these days.
Not really. The hard part is not the algorithm, it is the millions of samples of training data that have gone behind Google's system. They pretty much have every accent and way of speaking covered in their system which is what allows them to deliver such a high-accuracy speaker-independent system.
CMUSphinx is remarkable as an academic milestone, but in all honesty it's basically unusuable from a product standpoint. If your speech recognition is only 95% accurate, you're going to have a lot of very unhappy users. Average Joes are used to things like microwave ovens, which work 99.99% of the time, and expect new technology to "just work".
CMUSphinx is also an old algorithm; AFAIK Google is neural-network based.
I think we will have an easy to install OSS speech recognition library and accurate pretrained networks not far off from Google/Alexa/Baidu, running locally rather than in the cloud, within 1-2 years. Can't wait.
http://cmusphinx.sourceforge.net/
http://julius.osdn.jp/en_index.php
It is amazingly easy to create speech recognition without going out to any API these days.