I recently got asked to provide an opinion on “voice recognition”, in particular around our philosophy towards it and how we’ve implemented it across the stack. If you can stomach it, you can see how it turned out (let’s put it this way, it opens with a comparison to the Hoff’s “Knight Rider”) and it kind of goes downhill from there but regardless, in doing the research, I learnt some really interesting things along the way that I thought I’d share here.
First off, let’s start by asking how many of you know how speech recognition works these days? Well I thought I did, but it turns out I didn’t. Unlike the early approach, where you used to have to “train” the computer to understand you by spending hours and hours reading to your computer (which always kind of defeated the object to me), today, speech recognition works pretty much the same way they teach kids to speak/read, using phonemes, digraphs and trigraphs. The computer simply tries to recognise the shapes and patterns of the words being spoken, then using some clever logic and obviously an algorithm or two, performs some contextual analysis (makes a guess) on what is the most probable sentence or command you might be saying.
In the early days of speech recognition, the heavy lifting required was all about the listening and conversion from analogue to digital, today it’s in the algorithmic analysis on what it is most likely that you are saying. This subtle shift has opened up probably the most significant advance in voice recognition in the last twenty years, the concept of voice recognition as a “cloud” service.
A year or so ago, I opened a CIO event for Steve Ballmer, given I was on stage first, I got a front row seat at the event and watched Ballmer up close and personal as he proceeded to tell me, and the amassed CIO’s from our 200 largest customers, that the Kinect was in fact a “cloud device”. At the time I remember thinking, “bloody hell Steve, even for you that’s a bit of a stretch isn’t it?”. I filed it away under “Things CEO’s say when there’s no real news” and forgot about it, until now that is when I finally realised what he meant.
Basically, because with a connected device (like Kinect), the analysis of your movements and the processing for voice recognition can now also be done in the cloud. We now have the option (with the consumer’s appropriate permission) to use those events to provide a service that continuously learns and improves. This ultimately means that the voice recognition service you use today is actually different (and minutely inferior) to the same service that you’ll use tomorrow. This is incredibly powerful and also shows you that the “final mile” of getting voice recognition right lies more now with the algorithm that figures out what you’re mostly likely to be saying than it does with the actual recognition of the sounds. MSR have a number of projects underway around this (my current favourite being the MSR’s Sentence Completion Challenge), not to mention our own development around how this might apply within search.
Those of you that have been following these ramblings in the past will know I’m slightly sceptical of voice recognition, thinking that it is technology’s consistent wayward child, full of potential, yet unruly, unpredictable and relentlessly under-achieved. I’m not saying my view has changed overnight on this, but am certainly more inclined to think it will happen, based on this single, crucial point.
Kinect too provides its own clue that we’re a lot closer than we previously thought to making voice recognition a reality, not just in the fact that it uses voice recognition as a primary mode of (natural) interaction but more in how it tries to deal with the other end of the voice recognition problem – just how do you hear _anything_ when you are sat on top of the loudest source of noise in the room (the TV) when someone 10 feet away is trying to talk to you in the middle of a movie (or the final level on Sonic Generations, sat next to a screaming 6 year old who’s entire opinion of your success as a father rests on your ability to defeat the final “boss” ). If you have a few minutes and are interested, this is a wonderful article that talks specifically about that challenge and how we employ the use of an array of 4 microphones to try and solve the problem. There’s still more work to be done here, but it’s a great start in what is actually an incredibly complex problem – think about it, if I can’t even hear my wife in the middle of a game of Halo or an episode of Star Trek (original series of course) how the hell is Kinect going to hear? (Oh, I’ve just been informed by her that apparently that particular issue is actually not a technical problem… #awkward).
So these two subtle technological differences in our approach are going to make all the difference in voice recognition becoming a reality as part of a much more natural way of interacting with technology. Once that happens, we move into the really interesting part of the problem – our expectations of what we can do with it.
Our kids are a great way of understanding just how much of Pandora’s box getting into voice recognition (and other more natural forms of interaction) will be and I suspect that ultimately, our greatest challenge will be living up to the expectation of what is possible across all the forms of technical interaction we have, NUI parity across devices if you like. My son’s expectation (quite reasonably) is that if he can talk to his xBox, then he should be able to talk to any other device and furthermore, if he can ask it to play movies and navigate to games why can’t it do other things? I was sitting doing my research with him the night before my interview on all of this, and we were playing together at getting the voice recognition to work. He asked the xBox play his movie, he told Sonic which level to play on Kinect FreeRiders then he paused, looked at me and then back at the TV, cracked a cheeky smile and said, “Xbox, do my homework…”.