Client-side mobile speech recognition

Imagine if on your iPhone you had to type a whole paragraph, and then wait a few seconds for it to get sent to Apple’s server, and then get the text back to see if any words were mistyped or miscorrected. 

That is how speech recognition today works on mobile devices. It is all done server side (to try out a state-of-the-art example, download Dragon Dictation app on your iPhone, or try the built-in speech rec on an Android device). Perhaps Apple’s Siri will improve this (I hope!). But until speech recognition gets to be very close to 100% accuracy, the best way to improve the user experience will be to show each word and sentence as you speak and let you correct as it goes without waiting for the back-and-forth to web servers.

Open source projects like CMU’s PocketSphynx seem to provide sophisticated client side mobile speech rec. My understanding is that modern mobile devices don’t have the resources (processing/memory) to allow client-side speech rec to get nearly the accuracy levels as you can on the server side. At least not yet.


9 thoughts on “Client-side mobile speech recognition

  1. Tyler Arnold says:

    I am pretty sure the Siri version on the iPhone 4S does a lot more client side DSP work than the existing solutions out there

  2. chris dixon says:

    That’s my hope and would be really cool. Makes sense since it requires the 4S’s faster processor.

  3. bpadams says:

    Are computational resources the limiting factor? My understanding is that Google’s server-side voice-to-speech works so well because they’re able to do some empiric comparison between the captured audio and their enormous real-time database of queries. That’s how, e.g., they’re able to get so many proper nouns correct — they have access to an incomparable amount of context that would be impossible to build into any standalone computer (let alone a phone).

  4. bpadams says:

    * voice-to-text, obviously.

  5. chris dixon says:

    bpadams – i think we are agree, except you are saying that the computational resources on mobile devices need to get WAY bigger (google server farm bigger). I wonder if there is a power law here where they could load the top N nounds and get 80% of the improvement…

  6. kmamyk says:

    Speech recognition is not one big monolithic process. Typically it consists of the “front end”: phonetic recognition of speech utterances, word recognition from utterances, meaning extraction/semantic processing. Mobile devices nowadays can easily handle phonetic and word recognition (the constrain here is rather the memory size to hold the acoustic and pronunciation models), and the results can be sent to the server for further processing. This would allow for the on-device correction Chris is talking about.The downside is that it is very difficult (if not impossible) to develop the acoustic model that gives adequate results to all speakers under all acoustic conditions (noise level/pattern, etc). Acoustic models are “trained” with a set of speakers, under certain noise conditions that “averages” out differences b/w speakers.Of course you can come up with the process of re-targeting the model to the voice of the device owner, but you will need the original audio for that (or at least extracted MFCC).

  7. bpadams says:

    Yeah — I wonder how much empiric data you need in order to get performance that’s delightful for the user. That’s my big worry for Siri: that people tend to have very high expectations of speech recognition systems (cf., IVR systems).It could be marketing-speak, but Google says that they train their language model on 230 billion pieces of data (using 70 CPU-years), and that it’s both on query data and the collected voice samples. So, yikes.

  8. Srini Kumar says:

    Chris, have you tried the 4S? The voice recognition is absolutely incredible. It’s right out of science fiction! I am using it right now on this comment on your post and find that I almost never have to correct anything. The experience of using my own app tinyvox has been expanded and enhanced tremendously because of this work done by Apple. The dragon app was always deeply flawed and limited in user experience beyond repair. Talking this very comment out to you right now is something close to a spiritual experience. People talk about Siri as if it is some big deal, but this is an absolute breakthrough right here. Siri is a toy compared to this amazing new voice recognition feature. I urge you passionately to try it because it just works, flawlessly.

  9. Chris Hawkins says:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: