This is HUGE in my opinion. Prior to this, in order to get near state-of-the-art speech recognition in your system/application you either had to have/hire expertise to build your own or pay Nuance a significant amount of money to use theirs. Nuance has always been a "big bad" company in my mind. If I recall correctly, they've sued many of their smaller competitors out of existence and only do expensive enterprise deals. I'm glad their near monopoly is coming to an end.
I think Google's API will usher in a lot of new innovative applications.
Other "state-of-the-art" speech recognition solutions already exist.
For example, Microsoft has been offering it through its Project Oxford service.
https://www.projectoxford.ai/speech
I first learned about CMUSphinx from the [Jasper Project](https://jasperproject.github.io/). While Jasper provided an image for the Pi, I decided to go ahead and make a scripted install of CMUSphinx. I spent something like 2 frustrating days attempting to get it installed by hand in a repeatable fashion before giving up.
This was 2 years ago, so maybe it's simple now, but I didn't find it "amazingly easy" back then.
I do have a number of projects where I could definitely use a local speech recognition library. I have used [Python SpeechRecognition](https://github.com/Uberi/speech_recognition/blob/master/exam...) to essentially record and transcribe from a scanner. I wanted to take it further, but google at the time limited the number of requests per day. Today's announcement seems to indicate they will be expanding their free usage, but a local setup would be much better. I'd like to deploy this in a place that might not have reliable Internet.
In my experiences, the issues with building CMU Sphinx are mainly unspecified dependencies, undocumented version requirements, and forgetting to sacrifice the goat when the MSVC redistributable installer pops up.
We've written detailed, up-to-date instructions [1] for installing CMU Sphinx, and now also provide prebuilt binaries [2]!
If you're interested in not sending your audio to Google, CMU Sphinx and other libraries (like Kaldi and Julius), are definitely worth a second look.
Yeah I'm gonna leave a reply here just in case I need to find this again (already opened tabs, but you never know). This might be big for a stalled project at work. If this can un-stall that, I'll sure owe you a beer ;)
Would you mind submitting this documentation to CMU? I get the feeling they'd love to at least host a link to them or something to enhance their own documentation?
That sounds like my experience with it from about 5 years ago or so. I gave up on it also. It also didn't help that CMUSphinx has had more than one version in development in different languages.
Unfortunately, the situation hasn't improved much. Besides, even if you get it set up, the quality of the recognition isn't even close to the one from Google.
As someone who's worked with a lot of these engines, Nuance and IBM are the only really high quality players in the space. CMUSphinx and Julius are fine for low volume operations where you don't need really accurate response rates, but if you want high accuracy neither comes close from my experience.
As someone who has actually done objective tests, Google are by far the best, Nuance are a clear second. IBM Watson is awful though. Actually the worst I've tested.
Do you have a report of your tests? I'm interested in using speech recognition, but there are many start-ups and big players that it would be quite time consuming to get a quality/price analysis.
For the "dialect" of spanish that we speak in Argentina, Watson misses every single word. So, to me, CMUSphinx is valuable in that it allows me to tweak it, while IBM miserably fails at every word. Must've been trained with Spain or Mexican "neutral" spanish.
Googles engine also works fine (have been trying it with the phones), but the pricing may or may not be a deal breaker.
Is Julius really state-of-the-art? Looks like they use n-gram and HMMs. Those were the methods that achieve SotA 5+ years ago. My understanding is that Google and Microsoft are using end-to-end (or nearly) neural network models; these outperformed the older methods a few years ago. Not sure how CMUSphinx works under the hood.
They might not be considered state-of-the-art (if you consider both approaches in the same category), but they are definitely one valid approach to voice recognition, which works surprisingly well.
CMUSphinx is not a neural network based system, they do use language and acoustic modeling.
CMUSphinx is really easy to set up, and then being able to train it for one's specific domain probably beats state of the art with one-size-fits-all training.
> It is amazingly easy to create speech recognition without going out to any API these days.
Not really. The hard part is not the algorithm, it is the millions of samples of training data that have gone behind Google's system. They pretty much have every accent and way of speaking covered in their system which is what allows them to deliver such a high-accuracy speaker-independent system.
CMUSphinx is remarkable as an academic milestone, but in all honesty it's basically unusuable from a product standpoint. If your speech recognition is only 95% accurate, you're going to have a lot of very unhappy users. Average Joes are used to things like microwave ovens, which work 99.99% of the time, and expect new technology to "just work".
CMUSphinx is also an old algorithm; AFAIK Google is neural-network based.
I think we will have an easy to install OSS speech recognition library and accurate pretrained networks not far off from Google/Alexa/Baidu, running locally rather than in the cloud, within 1-2 years. Can't wait.
From the Microsoft Project Oxford Speech API link:
Speech Intent Recognition
... the server returns structured information about the incoming speech so that apps can easily parse the intent of the speaker, and subsequently drive further action. Models trained by the Project Oxford LUIS service are used to generate the intent.
Microsoft LUIS is almost identical to the intent classification and entity extraction in the Alexa Skills Kit, but it's easier to use because you can pipe in your own text from any source instead of having to use a specific speech recognition engine. LUIS also has a pretty nice web interface that prompts you to label utterances that it's seen that it had trouble with.
Exactly, this is what people are missing when they are trying to compare Google's speech recognition to other services. Google uses deep neural-networks to continuously train and improve the quality of their speech recognition, they get their training data from the hundreds of millions of Android users around the world using speech-to-text every day. No other company has a comparable amount of training data, continuously being expanded.
http://googleresearch.blogspot.ca/2015/09/google-voice-searc...
It's probably been on their roadmap for a while, before Alexa came out. Re: Alexa/echo - think there is an opportunity for someone to manufacture cheap USB array mics for far field capture.
Still, having this paid and cloud based puts a limit to types of things where you'd use it. I will use it in my own apps for now but will swap to a OSS speech recognition library running locally as soon as one emerges that is good enough.
You're right - this could lead to a lot of new innovation as a bunch of developers who wouldn't have bothered before can now start hacking away to see what they can do.
I've been thinking a lot lately about where the next major areas of technology-driven disruption might be in terms of employment impact, and things like this make me wonder how long it will be before call centers stacked wall to wall with customer service reps become a relic of the past...
If it's anything like Googles other APIs, people will build applications on top of it, and then Google will decide to shut down the API with no notice.
That's incorrect. This is a Google Cloud Platform service and when it reaches General Availability (GA) it will be subject to our Deprecation Policy. Just like Compute Engine, Cloud Storage, etc. requires us to give at least a 1 year heads up.
It's nice there's a policy around that, but I can understand the fears of someone considering using this to start a product - or even worse, a business.
Google has an history of shutting down useful products, why people should trust that one for long term integration?
Because we don't have a history of violating promises like this when done in writing? Seriously, I'd love to call us just "Cloud Platform" so you don't have to think "oh yeah, those guys cancelled reader on me" but if you look at the Cloud products we don't play games with this (partly because we hold ourselves to our binding Deprecation Policy, but mostly because we really care).
Google Search API, Autocomplete, Finance, Voice all closed with tons of active users. I'm not blaming Google; they were acting in their best interest, but the consequence is less enthusiasm for building software that depends on their APIs.
IMO a better option for Google, when considering to close an API, is to enforce payment and hike the price enough to justify maintaining it. If and only if enough users drop out with the higher price, then shut it down for good.
Even outside of services with a formal deprecation policy, Google rarely shuts anything down with no notice (their frequently cited shutdowns had long notice.)
No, and they dont need to. They have too many other advantages -- low customer acquisition cost via already-present cloud customers, economies of scale, ease of hiring the best talent, natural integrations via their android platform...who needs nastiness when you have all these amazing benefits!
Isn't Google's strategy typically to just start offering the same services as their smaller competitors but for free and then let them starve? ... Kind of like what's probably happening here? Sounds like this is terrible news for Nuance, for example.
> To attract developers, the app will be free at launch with pricing to be introduced at a later date.
Doesn't this mean you could spend time developing and building on the platform without knowing if your application is economically feasible? Seems like a huge risk to take for anything other than a hobby project.
Only if you spend a huge amount of time integrating with their API. It won't be more expensive than Nuance, so prepare with their pricing in mind and you'll be fine ($.008 / transaction over 20k monthly transactions).
While the platform has moved way ahead of where it was in 2011, the memory of this is still in the back of people's mind and it would be a good move on Google's part to at least try to engender more trust by giving directional pricing.
It is funny reading this in hindsight. We went straight to AWS, also for long running tcp connections, and never looked at it again. After that I have never recommended app engine.
Sure, would expect nothing less, but I'm also sure they have some sense of the range they are looking for. Such a range could help developers to establish whether their use case is viable or not.
As is, the ambiguity is a deterrent from too much time investment.
They did something very similar to this for their translate API. Realtime IRC client plugins that translated lines in foreign languages into yours in near-realtime went from "awesome" to "just kidding, nevermind" virtually overnight.
If (for any reasonably probable value of final pricing) the cost of voice recognition is a material factor in determining whether the application is economically feasible, you have larger problems. In other words: if your application's economics are going to swing materially depending on how this is priced, it's probably not going to work out even if they keep it free.
Apps can involve users speaking for ten seconds making selections, or an hour getting transcribed. Right now, we don't know the order of magnitude; adding voice interaction to an app could be $.01/user/year or it could be $100/user/year.
It depends. Imagine a hardware device that uses the speech API, and you're looking at the 3 year cost of an average of 5 speech interactions per day (something like the Amazon Echo). Using Nuance pricing of $0.008/request, you're adding 365 * 3 * 5 * $.008 = $43.80 in cost above the BOM cost; It's entirely possible that adding $20 to the BOM is viable, but adding $40 is not.
That may just be poor phrasing by TechCrunch. It's entirely possible Google will announce the pricing on launch day, but not charge for some period of time so people have the ability to develop the initial apps without paying.
Work in the pricing now (assuming it's the same as competition) and architecture your app so it's easy to replace the provider down the line if you need to, without breaking the clients.
I came across CMU Sphnix speech recognition library (http://cmusphinx.sourceforge.net) that has a BSD-style license and they just released a big update last month. It supports embedded and remote speech recognition. Could be a nice alternative for someone who may not need all of the bells and whistles and prefers to have more control rather than relying on an API which may not be free for long.
Side note: if anyone is interested in helping with an embedded voice recognition project please ping me.
The last time I did interesting stuff with these libs, I used CMUSphinx -- The documentation for Julius wasn't quite good enough yet, and CMUSphinx has great documentation.
Sphinx is supported by Carnegie Mellon and Julius by Kyoto University/Nagoya Institute of Technology.
I think the easier choice even today might still be Sphinx. Given the excellent documentation (they touch pretty much all the basics you need to know), and the availability of pocketsphinx (C) and Sphinx4 (Java).
It's been years since I used CMU Sphinx, but don't you have to bring your own training data? Sure, there are free data sets out there, and pre-trained models, but they are not as good as what Google et al. have.
Yes. And it's not just the data sets, its the fundamental technology. You really need state-of-the-art, ie LSTM/CTC to deal with noisy input data and to get to 99% accuracy (in addition to excellent data sets of course)
9 months ago for Portuguese, the amount of effort required for building a language model is too great.
Voice recognition was supposed to have been a cherry-on-top, but ended up taking out one of our senior developers for the duration of the month-long project, and we were ultimately unable to get it working in the time that we had available
Tangentially related: Does anyone remember the name of this startup/service that was on HN (I believe), that enables you to infer actions from plaintext.
Eg: "Switch on the lights" becomes
{"action": "switch_on",
"thing" : "lights"
}
etc.. I'm trying really hard to remember the name but it escapes me.
Speech recognition and <above service> will go very well together.
If you're having trouble (like me) to find your "Google Cloud Platform user account ID" to sign up for Limited Preview access, it's just the email address for your Google Cloud account. Took me only 40 minutes to figure that one out.
I wrote a client library for this in C# by reverse engineering what chrome did at the time (totally not legit/unsupported by google, possibly against their TOS). I have never used it for anything serious, and am glad now there is an endorsed way to do this.
> The Google Cloud Speech API, which will cover over 80 languages and will work with any application in real-time streaming or batch mode, will offer full set of APIs for applications to “see, hear and translate,” Google says.
Pretty impressive from the limited look the website (https://cloud.google.com/speech/) gives: the fact that Google will clean the audio of background noise for you and supports streamed input is particularly interesting.
I don't know I should feel about Google taking even more data from me (and other users). How would integrating this service work legally? Would you need to alert users that Google will keep their recordings on file (probably indefinitely and without being able to delete them)?
Unless I have gone crazy google has had a STT available to tinker with for awhile. It is one of the options for jasper [1]. Hopefully this means it will be easier to setup now.
Would be nice if they just open sourced it though but I imagine that is at crossed purposes with their business.
SoundHound released Houndify[1], their voice API last year which goes deeper than just speech recognition to include Speech-to-Meaning, Context and Follow-up, and Complex and Compound Queries. It will be cool to see what people will do with speech interfaces in the near future.
Speech recognition can occur either locally or on Google's servers. Since about 2012 [1], Android has been able to do some types of speech recognition, like dictation, on local devices. Additionally, Google Research has recently expanded on this functionality and it seems like much more of the speech recognition will be done locally [2].
Since Android is open-source, would that mean that the voice recognition software (and/or trained coefficients) could, in principle, be ported to Linux?
In the latest versions, there is no more Open Source anything.
Calendar, Contacts, Home screen, Phone app, Search, are all closed source now.
(btw, all of them, including the Google app, used to be open in Gingerbread)
You can't do TLS without going through Google apps (or packaging spongycastle), you can't do OpenGL ES 3.2, you can't use location anymore, nor use WiFi for your own location implementation.
Since Marshmallow, you are also forced to use Google Cloud Messaging, or the device will just prevent your app from receiving notifications.
To "save battery power" and "improve usability", Google monopolized all of Android.
To whomever downvoted the above post: Please clarify why you think it isn’t relevant to the discussion, or provide counterarguments. All the points I made can be easily sourced (if you wish, I can even post them in here), and are all verifiable.
It wasn’t meant aggressively, just as a question. It’s quite possible that the author of the comment I answered to had not used Android for a few years, or had never cared – or had just missed the announcements of the official apps not being supported anymore.
Oh, wait, there were no announcements, they were dropped silently.
It used to be. I remember "back in the day", windows 95/98 had it. It wasn't perfect, but it was decent with a bit of training such that you could dictate reasonably well.
Houndify launched last year and provides both speech recognition and natural language understanding. They have a free plan that never expires and transparent pricing. It can handle very complex queries that Google can't.
FWIW I'd just finished a large blog post researching ways to automate podcast transcription and subsequent NLP.
It includes lots of links to relevant research, tools, and services. Also includes discussion of the pros and cons of various services (Google/MS/Nuance/IBM/Vocapia etc.) and the value of vocabulary uploads and speaker profiles.
As a hard of hearing aspiring software developer, this would be a godsend for me if someone came up with a reliable automated transcription service. I'm often dismayed by the amount of valuable information locked in podcasts and non-transcribed videos and have to rely on goodwill of volunteers to give me transcripts.
Pycon did a admirable effort to live caption their talks last year but some of those transcripts never got uploaded along with the talks which is puzzling, but I suppose it could be due to lack of timecodes.
I've subbed to your blog and hopefully I can contribute whatever I can to make this work out.
FWIW, Google followed the same strategy with Cloud Vision (iirc)..they released it in closed beta for a couple of months [0], then made it generally available with a pricing structure [1].
I've never used Nuance but I've played around with IBM Watson [2], which gives you 1000 free minutes a month, and then 2 cents a minute afterwards. Watson allows you to upload audio in 100MB chunks (or is it 10 minute chunks?, I forgot), whereas Google currently allows 2 minutes per request (edit: according to their signup page [5])...but both Watson and Google allow streaming so that's probably a non-issue for most developers.
From my non-scientific observation...Watson does pretty well, such that I would consider using it for quick, first-pass transcription...it even gets a surprising number of proper nouns correctly including "ProPublica" and "Ken Auletta" -- though fudges things in other cases...its vocab does not include "Theranos", which is variously transcribed as "their in house" and "their nose" [3]
It transcribed the "Trump Steaks" commercial nearly perfect...even getting the homophones in "when it comes to great steaks I just raise the stakes the sharper image is one of my favorite stores with fantastic products of all kinds that's why I'm thrilled they agree with me trump steaks are the world's greatest steaks and I mean that in every sense of the word and the sharper image is the only store where you can buy them"...though later on, it messed up "steak/stake" [4]
It didn't do as great a job on this Trump "Live Free or Die" commercial, possibly because of the booming theme music...I actually did a spot check with Google's API on this and while Watson didn't get "New Hampshire" at the beginning, Google did [4]. Judging by how well YouTube manages to caption videos of all sorts, I would say that Google probably has a strong lead in overall accuracy when it comes to audio in the wild, just based on the data it processes.
edit: fixed the Trump steaks transcription...Watson transcribed the first sentence correctly, but not the other "steaks"
I would say that Google's main goal here is in expanding their training data set, as opposed to creating a new revenue stream. If it hurts competitors (e.g. Nuance) that might only be a side-effect of that main objective, and likely they will not aim to hurt the competition intentionally.
As others here have pointed out, the value now for GOOG is in building the best training data-set in the business, as opposed to just racing to find the best algorithm.
Question from a machine learning noob: how would they use an unlabeled dataset for training? Would Google employees listen to it all (if the ToS would even allow that) and transscribe it, then use the result for training? Or is there another way to make it useful without being labeled?
I doubt that using their customers' audio as training data was a major motivation for offering this service.
But, assuming that was their plan, they'd have a couple options:
- Like you said, they could turn it into supervised training examples by transcribing it. I'm sure they'd at least like to transcribe some of it so that they can measure their performance. Also, while Google does have a lot of 1st party applications feeding them training data, customer data might help them fill in some gaps.
- They might also be able get some value out of it without transcribing it. Neural networks can sometimes be pre-trained in an unsupervised manner. One example would be pre-training the network as an autoencoder, which just means training it to reproduce its input as its output. This can reduce convergence time.
Couldn't they go the the same route they took with classifying objects in images by using their users to label them? I know recaptcha provides an option to verify yourself by transcribing audio.
While aware of unsupervised learning, I thought I had read somewhere that it wasn't yet a solved problem to use it for this kind of learning. But that might be wrong, and the Wikipedia article mentions some interesting examples.
Quite a few methods of running recognition work in two stages (as a very large simplification).
First, you take the huge input (because something like sound has a huge amount of data in it, similarly with images there are a lot of pixels) and learn a simpler representation of it.
The second problem of mapping these nice dense features to actual things can be solved in different ways, even simple classifiers can perform well.
This doesn't actually need any labelled data. I just want to learn a smaller representation. For example, if we managed to learn a mapping from bits of audio to the phonetic alphabet then our speech recognition problem becomes one of just learning the mapping from the phonetic alphabet to words which is a far nicer problem to have.
Some ways of "deep learning" solve this first problem (of learning neater representations) through a step by step process of what I like to refer to as laziness.
Instead of trying to learn a really, really high level representation of your input data just learn a slightly smaller one. That's one layer. Then once we've got that we try and learn a smaller/denser representation on top of that. Then again, and again, and again.
How can you learn a smaller representation? Well a good way is to try and get a single layer to be able to regenerate its input. "Push" the input up, get the activations in the next layer, run the whole thing backwards and see how different your input is. You can then use this information to tweak the weights to make it slightly better the next time. Do this for millions and millions of inputs and it gets pretty good. This technique has been known about for a long time, but one of the triggers for the current big explosion of use was Hinton working out that this back and forth only really needs to be done once rather than 100 times (which was thought to be required beforehand).
Hinton says it made things 100,000 times faster because it was 1% of the computation required and it took him 17 years to realise it in which time computers got 1000 times faster. Along with this, GPUs got really really fast and easier to program. I took the original Hinton work that took weeks to train and had it running in hours back in 2008 on a cheap GPU. So before ~2006 this technique would have taken years of computer time, now it's down to minutes. Of course, that's then resulted in people building significantly larger networks that take much longer to train but would have been infeasible to run before.
But you still need a lot of unlabelled data. While I doubt google is doing that with this setup, they have done something before, where they setup a question answering service in the US that people could call I think for free to collect voice data.
TL;DR
You need labelled data. But it turns out you can learn most of what you need with unlabelled data, leaving you with a much simpler problem to solve. That's great because labelled data is massively more expensive than unlabelled data.
All of your audio is recorded (google.com/dashboard) then the audio files are thrown into a lottery. Individual people transcribe what they hear for the computer to build associations.
Or, at least, that's my best guess with zero research and little knowledge.
Has anyone tried adding OpenEars to their app, to prevent having to send things over the internet from e.g. a basement? Is it any good at recognizing basic speech?
In the sign-up form they state that "Note that each audio request is limited to 2 minutes in length." Does anyone know what "audio request" is? Does it mean that it's limited to 2 minutes when doing real-time recognition, or just that longer periods will count as more "audio requests" and result in a higher bill?
Do they provide a way to send audio via WebRTC or WebSocket from a browser?
Like most machine-learning applications, the source code isn't the interesting part, the data is. Google started by training on millions of phrases from Google 411, and then they've been able to continue training anytime someone issues a voice command to an Android device. They have orders of magnitude more data than you could fit into a GitHub repository.
There aren't many old movies (in the public domain) that are subtitled and/or have good audio... However, in the U.S., we have a huge amount of government audio/video that is captioned. Here's an example of using youtube-dl to download one of President Obama's speeches and extract the closed captions (which I believe are produced by the government) into its own VTT file:
Doesn't even need to be old movies. Certain types of video content in the US is legally required to have subtitles(e.g. a lot of youtube content). You could programmatically download them and use that as your training set. And, since it is a transformative work, you can train your models even on copyrighted works freely.
Much of the YouTube content has auto-generated subtitles, i.e. Google is running their speech-recognition software on the audio stream and then using that to caption the video. If you used that as your training set, you're effectively training on the output of an AI. Which is kind of a clever way to get information from Google to your open-source library, but will necessarily be lower-fidelity than just using the Google API directly.
In the US, if it's ever been played out on broadcast TV then it must have Closed Captions.
This is enforced by the FCC [0], but as more and more "internet" content gets consumed I imagine the same regulations will eventually come, at which point you've got a fantastic training set.
That's actually a pretty good idea. Lyrics for rap music might also be a good training data set. It'd bias strongly toward English, though, and particularly American English. I suspect the size of the resulting data set is also quite a bit smaller than what Google has.
Especially because Google's version is trained with illegally obtained user data (no, changing your ToS doesn't allow you to use previously collected data for new purposes in the EU).
We, as a society, should discuss if software trained on user data should be required to be available to those who have provided that data. If for software developed by training neural networks even any copyright can exist — or if it's by definition public domain.
Software trained with neural networks provides an immense advantage for existing monopolies, and makes it extremely hard for competitors.
If this trend continues, startups will become impossible for software that depends on being trained with huge datasets.
I thought I read open source, then I realized open access. I believe in the past there was a similar API, or maybe it was based on Google Translate. But I swear at one point people wrote hackathon projects using some voice APIs.
I wanted to see if there was good speech recognition and could Google about Nuance... Now that its going to be "disrupted" I found out about it.. ironic.
I would be hesitant to build an entire application that relied on this API only to have it removed in a few months or years when Google realizes it sucks up time and resources and makes them no money.
It seems there are a few options. Devs could use an abstraction layer and change providers with very little effort. Also, by the way things seem to be going, voice will become more and more prevalent.
I'm reading many libraries here, I wonder what's the best open and multi platform software for spech recognition to code with vim, Atom etc. I only saw a hybrid system working with dragon + Python on Windows. I would like to train/ customize my own system since I'm starting to have pain in tendons, and wrists. Do you think this Google Api can make it? Not being local looks like a limiting factor for speed, lag.
Nothing out of the box as far as I know. You'll have to DIY.
Have a look at CMUSphinx/Pocketsphinx [1]. I wrote a comment about training it for command recognition in a previous discussion[2].
It supports BNF grammar based training too [3], so I've a vague idea that it may be possible to use your programmming language's BNF to make it recognize language tokens. I haven't tried this out though.
Either way, be prepared to spend some time on the training. That's the hardest part with sphinx.
Also, have you seen this talk for doing it on Unix/Mac [4]? He does use natlink and dragonfly, but perhaps some concepts can be transferred to sphinx too?
What is the difference from a speech recognition API and [NLP libraries](https://opennlp.apache.org/)? This information was not easily found with a few google searches, so I figured others might have the same question.
I hope this opens up some new app possibilities for the Pebble Time. I believe right now they use Nuance and it's very limited to only responding to texts.
So this was very, very exciting until I realized you have to be using Google Cloud Platform to sign up for the preview. Unfortunately all of my stuff is in AWS and I could move it over but I'm not going (far too much hassle to preview an API I may not end up using, ultimately).
Regardless this is still very exiting. I haven't found anything that's as good as Google's voice recognition. I only hope this ends up being cheap and accessible outside of their platform.
I don't think there's any requirement that you use Google Compute Engine in order to use this API. Yes, you sign up for an account, but of course you have to sign up for an account to use it. This API is part of the platform.
Similarly, you can use the Google Translate API without using Compute Engine, App Engine, etc.
Note: I work for Google (but not on any of these products).
Ah that makes sense. It gave me the impression I had to run my services on there. I don't know if I'm the only person to screw that up or not. Thanks for clearing that up!
I think you just need to register for the Google cloud and can then use the service as you want it. There is probably an advantage if you upload it directly from a Google VM (no costs for uploading from a Google service/very low latency) but it would surprise me if you have to have a Google VM. I can after all upload files to S3 and googles cloud storage even if I host the application on my hardware.
I think Google's API will usher in a lot of new innovative applications.