Google opens access to its speech recognition API

blennon · on March 23, 2016

This is HUGE in my opinion. Prior to this, in order to get near state-of-the-art speech recognition in your system/application you either had to have/hire expertise to build your own or pay Nuance a significant amount of money to use theirs. Nuance has always been a "big bad" company in my mind. If I recall correctly, they've sued many of their smaller competitors out of existence and only do expensive enterprise deals. I'm glad their near monopoly is coming to an end.

I think Google's API will usher in a lot of new innovative applications.

BorisEm · on March 23, 2016

Other "state-of-the-art" speech recognition solutions already exist. For example, Microsoft has been offering it through its Project Oxford service. https://www.projectoxford.ai/speech

hardwaresofton · on March 23, 2016

Also, CMUSphinx and Julius:

http://cmusphinx.sourceforge.net/

http://julius.osdn.jp/en_index.php

It is amazingly easy to create speech recognition without going out to any API these days.

ytjohn · on March 23, 2016

I first learned about CMUSphinx from the [Jasper Project](https://jasperproject.github.io/). While Jasper provided an image for the Pi, I decided to go ahead and make a scripted install of CMUSphinx. I spent something like 2 frustrating days attempting to get it installed by hand in a repeatable fashion before giving up.

This was 2 years ago, so maybe it's simple now, but I didn't find it "amazingly easy" back then.

I do have a number of projects where I could definitely use a local speech recognition library. I have used [Python SpeechRecognition](https://github.com/Uberi/speech_recognition/blob/master/exam...) to essentially record and transcribe from a scanner. I wanted to take it further, but google at the time limited the number of requests per day. Today's announcement seems to indicate they will be expanding their free usage, but a local setup would be much better. I'd like to deploy this in a place that might not have reliable Internet.

uberi · on March 24, 2016

In my experiences, the issues with building CMU Sphinx are mainly unspecified dependencies, undocumented version requirements, and forgetting to sacrifice the goat when the MSVC redistributable installer pops up.

We've written detailed, up-to-date instructions [1] for installing CMU Sphinx, and now also provide prebuilt binaries [2]!

If you're interested in not sending your audio to Google, CMU Sphinx and other libraries (like Kaldi and Julius), are definitely worth a second look.

[1] https://github.com/Uberi/speech_recognition/blob/master/refe... [2] https://github.com/Uberi/speech_recognition/tree/master/thir...

dr_zoidberg · on March 25, 2016

Yeah I'm gonna leave a reply here just in case I need to find this again (already opened tabs, but you never know). This might be big for a stalled project at work. If this can un-stall that, I'll sure owe you a beer ;)

hardwaresofton · on March 24, 2016

Would you mind submitting this documentation to CMU? I get the feeling they'd love to at least host a link to them or something to enhance their own documentation?

ytjohn · on March 28, 2016

Thanks for providing this. Will definitely give it a fresh look.

simcop2387 · on March 23, 2016

That sounds like my experience with it from about 5 years ago or so. I gave up on it also. It also didn't help that CMUSphinx has had more than one version in development in different languages.

hardwaresofton · on March 24, 2016

I would note that as a positive... But yeah, 5 years ago things were much much rougher (which is partly why I didn't think it got so much press).

But these days, if you go all the way through their tutorial, and give it a proper read, it's very doable to set up.

sebak · on March 24, 2016

Unfortunately, the situation hasn't improved much. Besides, even if you get it set up, the quality of the recognition isn't even close to the one from Google.

chetatkinsdiet · on March 24, 2016

As someone who's worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space. CMUSphinx and Julius are fine for low volume operations where you don't need really accurate response rates, but if you want high accuracy neither comes close from my experience.

hardwaresofton · on March 24, 2016

Right, but they do offer you a fantastic starting point. If Nuance is 100%, I'd say CMUSphinx is at least 40%.

Also, they give you the tools and knowledge to build better models (and explain the theory), which is where most of the competitive advantage is IMHO.

IshKebab · on March 24, 2016

As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I've tested.

Pafnouti · on March 24, 2016

Do you have a report of your tests? I'm interested in using speech recognition, but there are many start-ups and big players that it would be quite time consuming to get a quality/price analysis.

dr_zoidberg · on March 25, 2016

For the "dialect" of spanish that we speak in Argentina, Watson misses every single word. So, to me, CMUSphinx is valuable in that it allows me to tweak it, while IBM miserably fails at every word. Must've been trained with Spain or Mexican "neutral" spanish.

Googles engine also works fine (have been trying it with the phones), but the pricing may or may not be a deal breaker.

blennon · on March 23, 2016

Is Julius really state-of-the-art? Looks like they use n-gram and HMMs. Those were the methods that achieve SotA 5+ years ago. My understanding is that Google and Microsoft are using end-to-end (or nearly) neural network models; these outperformed the older methods a few years ago. Not sure how CMUSphinx works under the hood.

hardwaresofton · on March 24, 2016

They might not be considered state-of-the-art (if you consider both approaches in the same category), but they are definitely one valid approach to voice recognition, which works surprisingly well.

CMUSphinx is not a neural network based system, they do use language and acoustic modeling.

dharma1 · on March 24, 2016

check out https://github.com/yajiemiao/eesen for LSTM and CTC based library instead of HMMs

FranOntanaya · on March 23, 2016

CMUSphinx is really easy to set up, and then being able to train it for one's specific domain probably beats state of the art with one-size-fits-all training.

dheera · on March 24, 2016

> It is amazingly easy to create speech recognition without going out to any API these days.

Not really. The hard part is not the algorithm, it is the millions of samples of training data that have gone behind Google's system. They pretty much have every accent and way of speaking covered in their system which is what allows them to deliver such a high-accuracy speaker-independent system.

CMUSphinx is remarkable as an academic milestone, but in all honesty it's basically unusuable from a product standpoint. If your speech recognition is only 95% accurate, you're going to have a lot of very unhappy users. Average Joes are used to things like microwave ovens, which work 99.99% of the time, and expect new technology to "just work".

CMUSphinx is also an old algorithm; AFAIK Google is neural-network based.

dharma1 · on March 24, 2016

Eesen looks promising, uses LSTM and CTC rather than older tech.

https://github.com/yajiemiao/eesen

Baidu open sourced their CTC implementation

https://github.com/baidu-research/warp-ctc

I think we will have an easy to install OSS speech recognition library and accurate pretrained networks not far off from Google/Alexa/Baidu, running locally rather than in the cloud, within 1-2 years. Can't wait.

hackuser · on March 23, 2016

From the Microsoft Project Oxford Speech API link:

Speech Intent Recognition

... the server returns structured information about the incoming speech so that apps can easily parse the intent of the speaker, and subsequently drive further action. Models trained by the Project Oxford LUIS service are used to generate the intent.

Do others offer something like this?

jjwiseman · on March 24, 2016

Microsoft LUIS is almost identical to the intent classification and entity extraction in the Alexa Skills Kit, but it's easier to use because you can pipe in your own text from any source instead of having to use a specific speech recognition engine. LUIS also has a pretty nice web interface that prompts you to label utterances that it's seen that it had trouble with.

IshKebab · on March 24, 2016

But Google's is by far the best.

bossx · on March 24, 2016

Exactly, this is what people are missing when they are trying to compare Google's speech recognition to other services. Google uses deep neural-networks to continuously train and improve the quality of their speech recognition, they get their training data from the hundreds of millions of Android users around the world using speech-to-text every day. No other company has a comparable amount of training data, continuously being expanded. http://googleresearch.blogspot.ca/2015/09/google-voice-searc...

_r5wf · on March 24, 2016

Kaldi is probably the best option now. https://github.com/kaldi-asr/kaldi

ericflo · on March 23, 2016

Interesting - I saw this as a defensive response to the rising number of developers using Amazon's Alexa APIs, rather than anything related to Nuance.

dharma1 · on March 24, 2016

It's probably been on their roadmap for a while, before Alexa came out. Re: Alexa/echo - think there is an opportunity for someone to manufacture cheap USB array mics for far field capture.

Still, having this paid and cloud based puts a limit to types of things where you'd use it. I will use it in my own apps for now but will swap to a OSS speech recognition library running locally as soon as one emerges that is good enough.

F_J_H · on March 24, 2016

You're right - this could lead to a lot of new innovation as a bunch of developers who wouldn't have bothered before can now start hacking away to see what they can do.

I've been thinking a lot lately about where the next major areas of technology-driven disruption might be in terms of employment impact, and things like this make me wonder how long it will be before call centers stacked wall to wall with customer service reps become a relic of the past...

idiot99 · on March 23, 2016

If it's anything like Googles other APIs, people will build applications on top of it, and then Google will decide to shut down the API with no notice.

Fun to play with, but don't expect it to last...

boulos · on March 24, 2016

That's incorrect. This is a Google Cloud Platform service and when it reaches General Availability (GA) it will be subject to our Deprecation Policy. Just like Compute Engine, Cloud Storage, etc. requires us to give at least a 1 year heads up.

Disclosure: I work on Compute Engine.

_pctq · on March 24, 2016

It's nice there's a policy around that, but I can understand the fears of someone considering using this to start a product - or even worse, a business.

Google has an history of shutting down useful products, why people should trust that one for long term integration?

boulos · on March 24, 2016

Because we don't have a history of violating promises like this when done in writing? Seriously, I'd love to call us just "Cloud Platform" so you don't have to think "oh yeah, those guys cancelled reader on me" but if you look at the Cloud products we don't play games with this (partly because we hold ourselves to our binding Deprecation Policy, but mostly because we really care).

zelcon · on March 24, 2016

Google Search API, Autocomplete, Finance, Voice all closed with tons of active users. I'm not blaming Google; they were acting in their best interest, but the consequence is less enthusiasm for building software that depends on their APIs.

IMO a better option for Google, when considering to close an API, is to enforce payment and hike the price enough to justify maintaining it. If and only if enough users drop out with the higher price, then shut it down for good.

dragonwriter · on March 24, 2016

Even outside of services with a formal deprecation policy, Google rarely shuts anything down with no notice (their frequently cited shutdowns had long notice.)

jamesdwilson · on March 23, 2016

Has Google sued many of their smaller competitors out of business?

TuringNYC · on March 24, 2016

No, and they dont need to. They have too many other advantages -- low customer acquisition cost via already-present cloud customers, economies of scale, ease of hiring the best talent, natural integrations via their android platform...who needs nastiness when you have all these amazing benefits!

grahamburger · on March 23, 2016

Isn't Google's strategy typically to just start offering the same services as their smaller competitors but for free and then let them starve? ... Kind of like what's probably happening here? Sounds like this is terrible news for Nuance, for example.

nl · on March 23, 2016

CaveTech · on March 23, 2016

> To attract developers, the app will be free at launch with pricing to be introduced at a later date.

Doesn't this mean you could spend time developing and building on the platform without knowing if your application is economically feasible? Seems like a huge risk to take for anything other than a hobby project.

jonknee · on March 23, 2016

Only if you spend a huge amount of time integrating with their API. It won't be more expensive than Nuance, so prepare with their pricing in mind and you'll be fine ($.008 / transaction over 20k monthly transactions).

https://developer.nuance.com/public/index.php?task=memberSer...

Spooky23 · on March 23, 2016

If you do any kind of volume with Nuance you get get pretty significant discounts.

On a past project, i think we got 45% off list without too much trouble.

sologoub · on March 23, 2016

I was really hoping they would post at least a preview of what the pricing will be.

AppEngine pricing change resulted in a nasty surprise for a lot of people and a lot of vitriol towards Google. Just one article on this from many: http://readwrite.com/2011/09/02/google-app-engine-pricing-an...

While the platform has moved way ahead of where it was in 2011, the memory of this is still in the back of people's mind and it would be a good move on Google's part to at least try to engender more trust by giving directional pricing.

mianos · on March 23, 2016

It is funny reading this in hindsight. We went straight to AWS, also for long running tcp connections, and never looked at it again. After that I have never recommended app engine.

bduerst · on March 23, 2016

Wasn't that five years ago? When it came out of preview (and prices were expected to change)?

sologoub · on March 23, 2016

It was - 2011 is mentioned in both article and my comment.

I just wish Google wouldn't bring back memories of that by not disclosing pricing of a very promising API.

GAE has moved quite far ahead since then, but many people still won't consider it after the bad experience. Perceptions die hard...

dragonwriter · on March 24, 2016

They probably are using usage data in the free period to drive decisions on pricing.

sologoub · on March 24, 2016

Sure, would expect nothing less, but I'm also sure they have some sense of the range they are looking for. Such a range could help developers to establish whether their use case is viable or not.

As is, the ambiguity is a deterrent from too much time investment.

bduerst · on March 25, 2016

Or the data determines that they have to price outside the hypothetical early-release range, and then you have people complaining that you lied.

KB1JWQ · on March 23, 2016

They did something very similar to this for their translate API. Realtime IRC client plugins that translated lines in foreign languages into yours in near-realtime went from "awesome" to "just kidding, nevermind" virtually overnight.

jaredhansen · on March 23, 2016

If (for any reasonably probable value of final pricing) the cost of voice recognition is a material factor in determining whether the application is economically feasible, you have larger problems. In other words: if your application's economics are going to swing materially depending on how this is priced, it's probably not going to work out even if they keep it free.

jimrandomh · on March 23, 2016

Apps can involve users speaking for ten seconds making selections, or an hour getting transcribed. Right now, we don't know the order of magnitude; adding voice interaction to an app could be $.01/user/year or it could be $100/user/year.

jjwiseman · on March 24, 2016

It depends. Imagine a hardware device that uses the speech API, and you're looking at the 3 year cost of an average of 5 speech interactions per day (something like the Amazon Echo). Using Nuance pricing of $0.008/request, you're adding 365 * 3 * 5 * $.008 = $43.80 in cost above the BOM cost; It's entirely possible that adding $20 to the BOM is viable, but adding $40 is not.

AnkhMorporkian · on March 23, 2016

That may just be poor phrasing by TechCrunch. It's entirely possible Google will announce the pricing on launch day, but not charge for some period of time so people have the ability to develop the initial apps without paying.

dharma1 · on March 24, 2016

Work in the pricing now (assuming it's the same as competition) and architecture your app so it's easy to replace the provider down the line if you need to, without breaking the clients.

zkirill · on March 23, 2016

I came across CMU Sphnix speech recognition library (http://cmusphinx.sourceforge.net) that has a BSD-style license and they just released a big update last month. It supports embedded and remote speech recognition. Could be a nice alternative for someone who may not need all of the bells and whistles and prefers to have more control rather than relying on an API which may not be free for long.

Side note: if anyone is interested in helping with an embedded voice recognition project please ping me.

mark_l_watson · on March 23, 2016

There is a project for Rapberry Pi to use Sphinx to roll your own Amazon Echo like device. You might want to take a look at that.

depingus · on March 23, 2016

Mycroft https://mycroft.ai/

mark_l_watson · on March 24, 2016

Thanks for the link, but this is the Raspberry Pi project: https://github.com/jasperproject

IshKebab · on March 24, 2016

Sphinx is an order of magnitude worse than Google. They're not in the same league.

hardwaresofton · on March 23, 2016

Came in here to say this (and also include julius), but yeah cmusphinx is awesome.

zkirill · on March 23, 2016

I haven't seen Julius before, thank you! How do you decide which one to use?

hardwaresofton · on March 24, 2016

The last time I did interesting stuff with these libs, I used CMUSphinx -- The documentation for Julius wasn't quite good enough yet, and CMUSphinx has great documentation.

Sphinx is supported by Carnegie Mellon and Julius by Kyoto University/Nagoya Institute of Technology.

I think the easier choice even today might still be Sphinx. Given the excellent documentation (they touch pretty much all the basics you need to know), and the availability of pocketsphinx (C) and Sphinx4 (Java).

There's also projects like this: https://github.com/syl22-00/pocketsphinx.js

TorKlingberg · on March 24, 2016

It's been years since I used CMU Sphinx, but don't you have to bring your own training data? Sure, there are free data sets out there, and pre-trained models, but they are not as good as what Google et al. have.

dharma1 · on March 24, 2016

Yes. And it's not just the data sets, its the fundamental technology. You really need state-of-the-art, ie LSTM/CTC to deal with noisy input data and to get to 99% accuracy (in addition to excellent data sets of course)

cheapsteak · on March 24, 2016

Tried it for a project, decent for english, but the support for non-english languages is not there

zkirill · on March 24, 2016

The website says they provide language models for "many languages".

When did you try it and which languages? Any particular issues you can share?

cheapsteak · on March 24, 2016

9 months ago for Portuguese, the amount of effort required for building a language model is too great.

Voice recognition was supposed to have been a cherry-on-top, but ended up taking out one of our senior developers for the duration of the month-long project, and we were ultimately unable to get it working in the time that we had available

A glimpse of what's involved: http://cmusphinx.sourceforge.net/wiki/tutoriallm

hardik988 · on March 24, 2016

Tangentially related: Does anyone remember the name of this startup/service that was on HN (I believe), that enables you to infer actions from plaintext.

Eg: "Switch on the lights" becomes

{"action": "switch_on", "thing" : "lights" }

etc.. I'm trying really hard to remember the name but it escapes me.

Speech recognition and <above service> will go very well together.

ar7hur · on March 24, 2016

Our service Wit.ai (YC W14) does just that.

Demo: https://labs.wit.ai/demo/index.html

hardik988 · on March 24, 2016

This is it! Thank you!

adyus · on March 24, 2016

I've been meaning to ask someone at Wit.ai this for a while:

Since your service is completely free, how do you plan on surviving? Would you open source any parts of Wit.ai should you go under?

I feel these are important questions to ask before investing time & energy into using your otherwise awesome service...

ragebol · on March 24, 2016

They've been acquired by Facebook, the main guys behind are now making Facebook M.

JustFinishedBSG · on March 24, 2016

> They've been acquired by Facebook

Ah so it will probably be discontinued

kuschku · on March 24, 2016

Which languages does your service support?

hk__2 · on March 24, 2016

From their website: de, en, es, et, fr, it, nl, pl, pt, ru, sv.

hardwaresofton · on March 23, 2016

In case you're not interested in having google run your speech recognition:

CMU Sphinx: http://cmusphinx.sourceforge.net/

Julius: http://julius.osdn.jp/en_index.php

melvinmt · on March 24, 2016

If you're having trouble (like me) to find your "Google Cloud Platform user account ID" to sign up for Limited Preview access, it's just the email address for your Google Cloud account. Took me only 40 minutes to figure that one out.

josephcooney · on March 24, 2016

I wrote a client library for this in C# by reverse engineering what chrome did at the time (totally not legit/unsupported by google, possibly against their TOS). I have never used it for anything serious, and am glad now there is an endorsed way to do this.

https://bitbucket.org/josephcooney/cloudspeech

theseatoms · on March 23, 2016

Key sentence:

> The Google Cloud Speech API, which will cover over 80 languages and will work with any application in real-time streaming or batch mode, will offer full set of APIs for applications to “see, hear and translate,” Google says.

jaflo · on March 24, 2016

Pretty impressive from the limited look the website (https://cloud.google.com/speech/) gives: the fact that Google will clean the audio of background noise for you and supports streamed input is particularly interesting.

I don't know I should feel about Google taking even more data from me (and other users). How would integrating this service work legally? Would you need to alert users that Google will keep their recordings on file (probably indefinitely and without being able to delete them)?

robohamburger · on March 23, 2016

Unless I have gone crazy google has had a STT available to tinker with for awhile. It is one of the options for jasper [1]. Hopefully this means it will be easier to setup now.

Would be nice if they just open sourced it though but I imagine that is at crossed purposes with their business.

[1] https://jasperproject.github.io/documentation/configuration/

jonah · on March 24, 2016

SoundHound released Houndify[1], their voice API last year which goes deeper than just speech recognition to include Speech-to-Meaning, Context and Follow-up, and Complex and Compound Queries. It will be cool to see what people will do with speech interfaces in the near future.

[1] https://www.houndify.com/

amelius · on March 23, 2016

Why isn't speech recognition just part of the OS? Like keyboard and mouse input.

christianmann · on March 23, 2016

Because speech recognition occurs on Google servers, not locally.

cryptoz · on March 23, 2016

Speech recognition can occur either locally or on Google's servers. Since about 2012 [1], Android has been able to do some types of speech recognition, like dictation, on local devices. Additionally, Google Research has recently expanded on this functionality and it seems like much more of the speech recognition will be done locally [2].

[1] http://www.androidcentral.com/jelly-bean-brings-offline-voic...

[2] http://www.zdnet.com/article/always-on-google-ai-gives-andro...

amelius · on March 23, 2016

> always-on-google-ai-gives-android-voice-recognition-that-works-on-or-offline

Since Android is open-source, would that mean that the voice recognition software (and/or trained coefficients) could, in principle, be ported to Linux?

Artur_Gaspar · on March 23, 2016

"Android" as it is distributed in mobile devices has some Google proprietary components.

kuschku · on March 24, 2016

> since Android is Open Source

You haven't used Android since 2010, have you?

In the latest versions, there is no more Open Source anything.

Calendar, Contacts, Home screen, Phone app, Search, are all closed source now.

(btw, all of them, including the Google app, used to be open in Gingerbread)

You can't do TLS without going through Google apps (or packaging spongycastle), you can't do OpenGL ES 3.2, you can't use location anymore, nor use WiFi for your own location implementation.

Since Marshmallow, you are also forced to use Google Cloud Messaging, or the device will just prevent your app from receiving notifications.

To "save battery power" and "improve usability", Google monopolized all of Android.

kuschku · on March 24, 2016

To whomever downvoted the above post: Please clarify why you think it isn’t relevant to the discussion, or provide counterarguments. All the points I made can be easily sourced (if you wish, I can even post them in here), and are all verifiable.

amelius · on March 24, 2016

It could also have been the tone of your first sentence.

Not that I mind personally.

kuschku · on March 24, 2016

It wasn’t meant aggressively, just as a question. It’s quite possible that the author of the comment I answered to had not used Android for a few years, or had never cared – or had just missed the announcements of the official apps not being supported anymore.

Oh, wait, there were no announcements, they were dropped silently.

zo1 · on March 24, 2016

It used to be. I remember "back in the day", windows 95/98 had it. It wasn't perfect, but it was decent with a bit of training such that you could dictate reasonably well.

mobiledev88 · on March 24, 2016

Houndify launched last year and provides both speech recognition and natural language understanding. They have a free plan that never expires and transparent pricing. It can handle very complex queries that Google can't.

timbunce · on March 24, 2016

FWIW I'd just finished a large blog post researching ways to automate podcast transcription and subsequent NLP.

It includes lots of links to relevant research, tools, and services. Also includes discussion of the pros and cons of various services (Google/MS/Nuance/IBM/Vocapia etc.) and the value of vocabulary uploads and speaker profiles.

http://blog.timbunce.org/2016/03/22/semi-automated-podcast-t...

metastew · on March 27, 2016

As a hard of hearing aspiring software developer, this would be a godsend for me if someone came up with a reliable automated transcription service. I'm often dismayed by the amount of valuable information locked in podcasts and non-transcribed videos and have to rely on goodwill of volunteers to give me transcripts.

Pycon did a admirable effort to live caption their talks last year but some of those transcripts never got uploaded along with the talks which is puzzling, but I suppose it could be due to lack of timecodes.

I've subbed to your blog and hopefully I can contribute whatever I can to make this work out.

vram22 · on March 23, 2016

For anyone who wants to try these areas a bit:

My trial of a Python speech library on Windows:

Speech recognition with the Python "speech" module:

http://jugad2.blogspot.in/2014/03/speech-recognition-with-py...

and also the opposite:

http://code.activestate.com/recipes/578839-python-text-to-sp...

danso · on March 23, 2016

FWIW, Google followed the same strategy with Cloud Vision (iirc)..they released it in closed beta for a couple of months [0], then made it generally available with a pricing structure [1].

I've never used Nuance but I've played around with IBM Watson [2], which gives you 1000 free minutes a month, and then 2 cents a minute afterwards. Watson allows you to upload audio in 100MB chunks (or is it 10 minute chunks?, I forgot), whereas Google currently allows 2 minutes per request (edit: according to their signup page [5])...but both Watson and Google allow streaming so that's probably a non-issue for most developers.

From my non-scientific observation...Watson does pretty well, such that I would consider using it for quick, first-pass transcription...it even gets a surprising number of proper nouns correctly including "ProPublica" and "Ken Auletta" -- though fudges things in other cases...its vocab does not include "Theranos", which is variously transcribed as "their in house" and "their nose" [3]

It transcribed the "Trump Steaks" commercial nearly perfect...even getting the homophones in "when it comes to great steaks I just raise the stakes the sharper image is one of my favorite stores with fantastic products of all kinds that's why I'm thrilled they agree with me trump steaks are the world's greatest steaks and I mean that in every sense of the word and the sharper image is the only store where you can buy them"...though later on, it messed up "steak/stake" [4]

It didn't do as great a job on this Trump "Live Free or Die" commercial, possibly because of the booming theme music...I actually did a spot check with Google's API on this and while Watson didn't get "New Hampshire" at the beginning, Google did [4]. Judging by how well YouTube manages to caption videos of all sorts, I would say that Google probably has a strong lead in overall accuracy when it comes to audio in the wild, just based on the data it processes.

edit: fixed the Trump steaks transcription...Watson transcribed the first sentence correctly, but not the other "steaks"

[0] http://www.businessinsider.com/google-offers-computer-vision...

[1] http://9to5google.com/2016/02/18/cloud-vision-api-beta-prici...

[2] https://github.com/dannguyen/watson-word-watcher

[3] https://gist.github.com/dannguyen/71d49ff62e9f9eb51ac6

[4] https://www.youtube.com/watch?v=EYRzpWiluGw

[5] https://services.google.com/fb/forms/speech-api-alpha/

nfriedly · on March 24, 2016

It's 100MB chunks, and you can compress it with opus or flac to squeeze in more audio per chunk :)

ocdtrekkie · on March 23, 2016

"Google may choose to raise those prices over time, after it becomes the dominant player in the industry."

...Isn't that specifically what anticompetition laws were written to prevent?

KyleBrandt · on March 23, 2016

As a developer I might be more worried about it not becoming at least amongst the dominant players because they might just drop it.

But maybe they only do that with consumer facing items?

ocdtrekkie · on March 23, 2016

Google kills off APIs often. Remember the whole Translate API fiasco? Though that was for overuse, not underuse.

skj · on March 24, 2016

This is a Google Cloud Platform service, and is subject to long deprecation policies at the very least.

j1vms · on March 23, 2016

I would say that Google's main goal here is in expanding their training data set, as opposed to creating a new revenue stream. If it hurts competitors (e.g. Nuance) that might only be a side-effect of that main objective, and likely they will not aim to hurt the competition intentionally.

As others here have pointed out, the value now for GOOG is in building the best training data-set in the business, as opposed to just racing to find the best algorithm.

jrv · on March 24, 2016

Question from a machine learning noob: how would they use an unlabeled dataset for training? Would Google employees listen to it all (if the ToS would even allow that) and transscribe it, then use the result for training? Or is there another way to make it useful without being labeled?

gradys · on March 24, 2016

I doubt that using their customers' audio as training data was a major motivation for offering this service.

But, assuming that was their plan, they'd have a couple options:

- Like you said, they could turn it into supervised training examples by transcribing it. I'm sure they'd at least like to transcribe some of it so that they can measure their performance. Also, while Google does have a lot of 1st party applications feeding them training data, customer data might help them fill in some gaps.

- They might also be able get some value out of it without transcribing it. Neural networks can sometimes be pre-trained in an unsupervised manner. One example would be pre-training the network as an autoencoder, which just means training it to reproduce its input as its output. This can reduce convergence time.

goodplay · on March 24, 2016

Couldn't they go the the same route they took with classifying objects in images by using their users to label them? I know recaptcha provides an option to verify yourself by transcribing audio.

TuringNYC · on March 24, 2016

https://en.wikipedia.org/wiki/Unsupervised_learning

jrv · on March 24, 2016

While aware of unsupervised learning, I thought I had read somewhere that it wasn't yet a solved problem to use it for this kind of learning. But that might be wrong, and the Wikipedia article mentions some interesting examples.

IanCal · on March 24, 2016

Quite a few methods of running recognition work in two stages (as a very large simplification).

First, you take the huge input (because something like sound has a huge amount of data in it, similarly with images there are a lot of pixels) and learn a simpler representation of it.

The second problem of mapping these nice dense features to actual things can be solved in different ways, even simple classifiers can perform well.

This doesn't actually need any labelled data. I just want to learn a smaller representation. For example, if we managed to learn a mapping from bits of audio to the phonetic alphabet then our speech recognition problem becomes one of just learning the mapping from the phonetic alphabet to words which is a far nicer problem to have.

Some ways of "deep learning" solve this first problem (of learning neater representations) through a step by step process of what I like to refer to as laziness.

Instead of trying to learn a really, really high level representation of your input data just learn a slightly smaller one. That's one layer. Then once we've got that we try and learn a smaller/denser representation on top of that. Then again, and again, and again.

How can you learn a smaller representation? Well a good way is to try and get a single layer to be able to regenerate its input. "Push" the input up, get the activations in the next layer, run the whole thing backwards and see how different your input is. You can then use this information to tweak the weights to make it slightly better the next time. Do this for millions and millions of inputs and it gets pretty good. This technique has been known about for a long time, but one of the triggers for the current big explosion of use was Hinton working out that this back and forth only really needs to be done once rather than 100 times (which was thought to be required beforehand).

Hinton says it made things 100,000 times faster because it was 1% of the computation required and it took him 17 years to realise it in which time computers got 1000 times faster. Along with this, GPUs got really really fast and easier to program. I took the original Hinton work that took weeks to train and had it running in hours back in 2008 on a cheap GPU. So before ~2006 this technique would have taken years of computer time, now it's down to minutes. Of course, that's then resulted in people building significantly larger networks that take much longer to train but would have been infeasible to run before.

But you still need a lot of unlabelled data. While I doubt google is doing that with this setup, they have done something before, where they setup a question answering service in the US that people could call I think for free to collect voice data.

TL;DR

You need labelled data. But it turns out you can learn most of what you need with unlabelled data, leaving you with a much simpler problem to solve. That's great because labelled data is massively more expensive than unlabelled data.

murjinsee · on March 24, 2016

All of your audio is recorded (google.com/dashboard) then the audio files are thrown into a lottery. Individual people transcribe what they hear for the computer to build associations.

Or, at least, that's my best guess with zero research and little knowledge.

somerandomness · on March 24, 2016

Google gets plenty of speech training data from Android phones; I doubt they need more from startups.

kuschku · on March 24, 2016

And it opens up legal questions

Like if neural networks trained with user data should be un-copyrightable, and public domain by default.

zkhalique · on March 23, 2016

Has anyone tried adding OpenEars to their app, to prevent having to send things over the internet from e.g. a basement? Is it any good at recognizing basic speech?

alexcaps · on March 23, 2016

Pretty bad except for keyword spotting.

szimek · on March 23, 2016

In the sign-up form they state that "Note that each audio request is limited to 2 minutes in length." Does anyone know what "audio request" is? Does it mean that it's limited to 2 minutes when doing real-time recognition, or just that longer periods will count as more "audio requests" and result in a higher bill?

Do they provide a way to send audio via WebRTC or WebSocket from a browser?

amelius · on March 23, 2016

Nice. But what I want is open-source speech recognition.

nostrademons · on March 23, 2016

Like most machine-learning applications, the source code isn't the interesting part, the data is. Google started by training on millions of phrases from Google 411, and then they've been able to continue training anytime someone issues a voice command to an Android device. They have orders of magnitude more data than you could fit into a GitHub repository.

tomp · on March 23, 2016

Couldn't you just download subtitles for old movies and train using those?

danso · on March 23, 2016

There aren't many old movies (in the public domain) that are subtitled and/or have good audio... However, in the U.S., we have a huge amount of government audio/video that is captioned. Here's an example of using youtube-dl to download one of President Obama's speeches and extract the closed captions (which I believe are produced by the government) into its own VTT file:

https://asciinema.org/a/93ihv5er83mrpihh9i5m3s9o5

Congress has video for all of its sessions and it is transcribed. So does the Supreme Court (though not timestamped).

oh_sigh · on March 23, 2016

Doesn't even need to be old movies. Certain types of video content in the US is legally required to have subtitles(e.g. a lot of youtube content). You could programmatically download them and use that as your training set. And, since it is a transformative work, you can train your models even on copyrighted works freely.

nostrademons · on March 23, 2016

Much of the YouTube content has auto-generated subtitles, i.e. Google is running their speech-recognition software on the audio stream and then using that to caption the video. If you used that as your training set, you're effectively training on the output of an AI. Which is kind of a clever way to get information from Google to your open-source library, but will necessarily be lower-fidelity than just using the Google API directly.

_puk · on March 24, 2016

In the US, if it's ever been played out on broadcast TV then it must have Closed Captions.

This is enforced by the FCC [0], but as more and more "internet" content gets consumed I imagine the same regulations will eventually come, at which point you've got a fantastic training set.

0: https://www.fcc.gov/node/23883

nostrademons · on March 23, 2016

That's actually a pretty good idea. Lyrics for rap music might also be a good training data set. It'd bias strongly toward English, though, and particularly American English. I suspect the size of the resulting data set is also quite a bit smaller than what Google has.

Teodolfo · on March 24, 2016

Kaldi + voxforge http://kaldi.sourceforge.net/about.html http://www.voxforge.org/

jjwiseman · on March 24, 2016

This paper from 2014 is a comparison of 5 different open source speech recognizers: http://suendermann.com/su/pdf/oasis2014.pdf

The Word Error Rates (lower is better) for each recognizer on two different corpora, VM1 and WSJ1:

  RECOGNIZER        VM1  WSJ1
  HDecode v3.4.1    22.9 19.8
  Julius v4.3       27.2 23.1
  pocketsphinx v0.8 23.9 21.4
  Sphinx-4          26.9 22.7
  Kaldi             12.7 6.5

z3t4 · on March 24, 2016

At least offer a self hosted version. Maybe it's just me, but I'm not comfortable sending every spoken word to Google.

kuschku · on March 24, 2016

Exactly this.

Especially because Google's version is trained with illegally obtained user data (no, changing your ToS doesn't allow you to use previously collected data for new purposes in the EU).

We, as a society, should discuss if software trained on user data should be required to be available to those who have provided that data. If for software developed by training neural networks even any copyright can exist — or if it's by definition public domain.

Software trained with neural networks provides an immense advantage for existing monopolies, and makes it extremely hard for competitors.

If this trend continues, startups will become impossible for software that depends on being trained with huge datasets.

yeukhon · on March 23, 2016

I thought I read open source, then I realized open access. I believe in the past there was a similar API, or maybe it was based on Google Translate. But I swear at one point people wrote hackathon projects using some voice APIs.

dominotw · on March 24, 2016

Nice! Curious how it compares to amazon's avs that went public this week.

https://github.com/amzn/alexa-avs-raspberry-pi

saurik · on March 23, 2016

I think this more directly competes with the IBM Watson speech API, not Nuance?

jonknee · on March 23, 2016

Why? This is almost exactly what Nuance provides:

https://developer.nuance.com/public/index.php?task=memberSer...

swah · on March 24, 2016

I wanted to see if there was good speech recognition and could Google about Nuance... Now that its going to be "disrupted" I found out about it.. ironic.

Try Googling "speech recognition api"...

Spooky23 · on March 23, 2016

IBM's offering is using the nuance engine, with Watson using its magic to make more accurate predictions based on context.

chetatkinsdiet · on March 24, 2016

IBM split from Nuance a year or so ago- it's now its own engine.

raverbashing · on March 23, 2016

Not sure how many users of IBM speech API but Nuance is big in this area (also offers offline conversion products)

bduerst · on March 23, 2016

Well NUAN took a 5% price hit after the announcement today, so maybe there's some overlap?

Negative1 · on March 24, 2016

I would be hesitant to build an entire application that relied on this API only to have it removed in a few months or years when Google realizes it sucks up time and resources and makes them no money.

visarga · on March 24, 2016

It seems there are a few options. Devs could use an abstraction layer and change providers with very little effort. Also, by the way things seem to be going, voice will become more and more prevalent.

hans · on March 23, 2016

cool, next up is a way to tweak the speech API to recognize patterns in stocks and capex .. wasn't that what Renaissance Technologies did ?

really GooG should democratize quant stuff next .. diy hedge fund algos.

alfonsodev · on March 23, 2016

I'm reading many libraries here, I wonder what's the best open and multi platform software for spech recognition to code with vim, Atom etc. I only saw a hybrid system working with dragon + Python on Windows. I would like to train/ customize my own system since I'm starting to have pain in tendons, and wrists. Do you think this Google Api can make it? Not being local looks like a limiting factor for speed, lag.

lovelearning · on March 24, 2016

Nothing out of the box as far as I know. You'll have to DIY.

Have a look at CMUSphinx/Pocketsphinx [1]. I wrote a comment about training it for command recognition in a previous discussion[2].

It supports BNF grammar based training too [3], so I've a vague idea that it may be possible to use your programmming language's BNF to make it recognize language tokens. I haven't tried this out though.

Either way, be prepared to spend some time on the training. That's the hardest part with sphinx.

Also, have you seen this talk for doing it on Unix/Mac [4]? He does use natlink and dragonfly, but perhaps some concepts can be transferred to sphinx too?

[1]: http://cmusphinx.sourceforge.net/ [2]: https://news.ycombinator.com/item?id=11174762 [3]: http://cmusphinx.sourceforge.net/doc/sphinx4/edu/cmu/sphinx/... [4]: https://www.youtube.com/watch?v=8SkdfdXWYaI

dimatter · on March 24, 2016

http://voicecode.io

zelcon · on March 24, 2016

Great, now when will Google let us use the OCR engine they crowdsourced from us over the last decade with ReCaptcha. tesseract is mediocre.

scotth · on March 24, 2016

https://cloud.google.com/vision/

willwill100 · on March 23, 2016

Will be interesting to compare with http://www.speechmatics.com

spdustin · on March 23, 2016

There's no hint of pricing that I could find on their page. Do you know about their pricing model?

willwill100 · on March 24, 2016

If you sign up you can scroll down and see the pricing on https://www.speechmatics.com/account/

chairlaw · on March 23, 2016

What is the difference from a speech recognition API and [NLP libraries](https://opennlp.apache.org/)? This information was not easily found with a few google searches, so I figured others might have the same question.

jbkkd · on March 23, 2016

NLP libraries process written text. This new API processes speech and extracts text from it.

infocollector · on March 24, 2016

What is the best speech recognition engine, assuming one has no internet?

sounds · on March 24, 2016

At this time, CMU Sphinx or Julius are good choices, not great but worth a look.

vincent_s · on March 24, 2016

Don't get too excited: https://www.google.com/search?q=google+shuts+down+api

flanbiscuit · on March 23, 2016

I hope this opens up some new app possibilities for the Pebble Time. I believe right now they use Nuance and it's very limited to only responding to texts.

mysticmode · on March 24, 2016

I'm not sure, what will happen to Google's webspeech API in the future. Whether it will be continued as a free service.

mark_l_watson · on March 23, 2016

I think they are pushing back against Amazon's Echo speech APIs, which I have experimented with.

I just applied for early access.

omarforgotpwd · on March 24, 2016

Fuck. Yes. IBM has a similar API as well as part of their Watson APIs but I really wanted to use Google's.

sandra_saltlake · on March 24, 2016

Sounds like this is bad news for Nuance,

E4life · on March 24, 2016

Finally, this is something that will be the main way for communication in the future.

jupp0r · on March 23, 2016

Anybody got the api docs yet? I wonder if I can stream from chrome via webrtc.

zmodem · on March 23, 2016

Can't you just use the Web Speech API for that? https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.htm...

jupp0r · on March 24, 2016

I want to use cloud speech api for various reasons

braindead_in · on March 24, 2016

How well does this work with conversational speech? Any benchmarks?

BinaryIdiot · on March 23, 2016

So this was very, very exciting until I realized you have to be using Google Cloud Platform to sign up for the preview. Unfortunately all of my stuff is in AWS and I could move it over but I'm not going (far too much hassle to preview an API I may not end up using, ultimately).

Regardless this is still very exiting. I haven't found anything that's as good as Google's voice recognition. I only hope this ends up being cheap and accessible outside of their platform.

oaktowner · on March 23, 2016

I don't think there's any requirement that you use Google Compute Engine in order to use this API. Yes, you sign up for an account, but of course you have to sign up for an account to use it. This API is part of the platform.

Similarly, you can use the Google Translate API without using Compute Engine, App Engine, etc.

Note: I work for Google (but not on any of these products).

BinaryIdiot · on March 24, 2016

Ah that makes sense. It gave me the impression I had to run my services on there. I don't know if I'm the only person to screw that up or not. Thanks for clearing that up!

Matt3o12_ · on March 23, 2016

I think you just need to register for the Google cloud and can then use the service as you want it. There is probably an advantage if you upload it directly from a Google VM (no costs for uploading from a Google service/very low latency) but it would surprise me if you have to have a Google VM. I can after all upload files to S3 and googles cloud storage even if I host the application on my hardware.