Apple’s Siri and advanced in-car technology have focused the spotlight on voice technology, increasing demand for developers who can help both startups and established companies integrate speech in a range of new ways.
Though not new – Carnegie Mellon University has been teaching classes in it for probably 30 years, notes Alan Black, associate professor at the school’s Language Technologies Institute – it’s evolved far beyond the call center systems and medical transcription technology of the past.
Now, though, the technology is more straightforward, with new APIs making the development of speech applications in common programming languages easier than ever.
“Siri put the notion of public dialogue systems in the public view. My students have commented that now their mothers know what they do. Before, they could never explain it,” Black laughs.
SiliconValley.com recently referred to a “high-profile voice-technology arms race” between Google and Apple – and reportedly, a dispute over voice turn-by-turn directions led to Apple’s disastrous decision to dump Google Maps. It’s not just the big names who are interested in voice. Ron Kaplan, senior director and distinguished scientist at Nuance Communications, told SiliconValley.com that voice technology used to be like talking to a toddler, but now it’s grown up somewhat. Today it’s more like talking with a five- or six-year-old, with the conversation “starting to get interesting.”
While the technology still makes mistakes, Black says companies have begun to find acceptable levels of accuracy. That, in turn, means people “can start to look at more. So rather than just dictation, which people were doing 10 years ago, people are realizing you don’t have to be that accurate,” he explains. “Now companies like Google and others are trying to make speech more than just command and control like it was before.”
In addition, combining voice with other technologies is leading to new and innovative creations. Nuance, for instance, is working with chip makers to allow smartphones to respond to voice commands even when they’re in “sleep” mode.
William Meisel, editor of Speech Strategy News, is fascinated with Siri’s ability to actually answer your questions, rather than just provide a list of links – something that disrupts Google’s ad strategy. “Where this becomes important commercially, is that when you just answer the question, you are bypassing classical search,” he says.
Not only that, it can dive into your email, calendar, contact lists and other programs. It not only quickly finds information for you, but composes emails when you need to, say, reschedule a meeting. Meisel calls such approaches the next step beyond the graphical user interface, which has become overburdened with too many files, especially on phones’ small screens.
At the same time voice technology, as a field, is becoming more interdisciplinary. For instance, the University of Arizona’s Human Language Technology master’s program includes computational linguistics, natural language processing, computer science, artificial intelligence, linguistics, psychology, philosophy, mathematics and statistics.
Highly Sought Skills
While students with master’s and doctorate degrees in speech technology are highly sought after, Black says, undergraduate computer science majors with just one class in the area are also being snapped up for software engineering positions at companies like IBM, Microsoft, Apple, Google and Facebook.
If they know multiple languages — speech languages, we mean — so much the better. Says Black: “A lot of companies prototype in English, then want to translate [the software] into an array of languages.”
More specifically, employers are looking not only for skills in computational linguistics and natural language processing, but also experience with the industry standard language VoiceXML, as well as C/C++/Objective C and the Unix/Linux, iOS and Android.
While a lot of platforms use VoiceXML, they’re often not compatible with each other, Black says. CMU teaches VoiceXML, but there’s been a proliferation of speech APIs in various programming languages, so he also sends students out to work with those.
“Now we’re talking about having speech recognition and speech synthesis be part of any programming language like Python or C or .NET or whatever … that allow you to better integrate those into your systems, in the same way you have Windows toolkits or Android.”
“While I would not say that any programmer can use these APIs and be successful – there are limitations; you have to understand what they can and cannot do – most programming students can learn to use speech in interesting ways,” Black believes.
Next: Learning to ‘Read’ Emotion
Now taking shape is research into developing systems that can discern the speaker’s emotions. For instance, Black says, are users sure about what they’re asking? When offered an option, can the technology tell whether it’s what users really want, or are they just settling? “Being able to recognize expressiveness, emotion in speech is something new that’s beginning to happen,” he says.
Why does this matter? One CMU student worked on the voice system for finding bus routes that the university provides to the city of Pittsburgh. After testing a number of scenarios for dealing with angry callers, including growing louder, in essence turning the call into a shouting match, he learned that if the system grew quieter, people were more likely to calm down.