With the success of Amazon’s Echo and it's voice-controlled assistant Alexa, the smart speaker war is heating up to battle for the hub of home automation.
Traditionally, these devices needed to be operated with buttons, a remote, or other physical controls, limiting their capabilities. As AI becomes more mainstream and customers demand more from their devices, the need to become more user-friendly grows. Consumers want instantaneous responses without having to hunt down a remote, or get up to approach their device - the demand for far-field voice activation is just around the corner.
Far-field voice technology is involved in Amazon Echo, Google Home, and Apple HomePod amongst others, but how? Far-field speech recognition is far more complicated than we might have initially though, and Tao Ma, Principal Architect, AI Platform & Research at JD.com has shared with us some of his work in the area including the background, system design and architecture of these systems.
Tao will be speaking at the AI Assistant Summit in San Francisco this January 25 & 26 and will expand further on his current work and share his most recent progressions in the field. Having perviously worked as a Senior Speech Scientist at Siri, Apple, we had several interesting questions to ask Tao, but kicked off by finding out a bit more about his current and past work with voice assistants. Currently at JD.com, China’s largest online retailer, he is driving the effort of ‘commercialization of voice technology in current and future JD.com products.’ This includes integrating these features into an AI shopping assistant, intelligent customer service, voice product search, and smart speakers. Previously at Siri, Tao was working in ‘R&D on core algorithms to improve the performance of Siri speech recognition, improving recognition accuracy, reduce recognition latency, developing scalable and high availability speech platform on the cloud.’
Back in 2003, Tao participated in Microsoft’s “Creative Cup” International .NET Programming Competition where he used Microsoft Speech API and SALT (Speech Application Language Tags) to enable voice interactions.
‘That was the first time I developed voice technology and I was hooked. Later I came to United States to pursue my PhD degree in human language technology, during my PhD I was the maintainer of ISIP recognizer which was one of the popular open-sourced speech recognizers at that time. In May 2012, I joined Siri team at Apple, working to improve the underneath speech recognition technology to make Siri better understanding human speech.’
Throughout his work, Tao has found that there are two major challenges for far-field voice technology: ‘sound reverberation and noises.’ During far-field voice interaction, the distance between microphone and the talking person can be as far as several meters and there are sound-reflective surfaces in between. The speech signal is reflected causing a large number of reflections(floor, walls, etc.) to build up and then decay as the sound is absorbed by the surfaces of objects in the room. Additive noises is another problem, for example people talking in the background, TV is on. ‘Thus, speech signal is usually significantly degraded by reverberation and additive noises.’ This is part of the reason that these assistants often generate incorrect or unrelated results to our queries.
A voice assistant involves multiple AI components: ASR, TTS, NLP, dialogue. Development of each component alone has many challenges, for example far-field for ASR. Tao explained that he found the most challenging part to be ‘how to coherently integrate all components together to make the voice assistant reach the best possible performance. Since each component is developed and evaluated using different metrics, the joint optimizations among these components are very important and critical for the ultimate system performance: the task completion rate.’
The popularity of home voice assistants was somewhat unexpected, and as a result many experts in the area, including Tao, are now very optimistic for the future of smart home devices. ‘Voice is becoming the mainstream interface for all kinds of smart home devices. The popularity of smart home speaker is introducing us to a new era, I am expecting to see prosperity of more smart home devices. Voice controlled smart TV, microwave, thermometer, you name it.’ People are getting used to talk to their devices and it’s not seen as an embarrassing or abnormal action to carry out in public.
'In the future, for any home appliance manufacture, lacking of voice interface would become a significant disadvantage in this highly competitive industry.'
Voice assistants are sure to become proficient in most customer service roles, so I asked Tao his opinion on this and whether we would see a significant amount of jobs lost as a result of near human AI in voice assistants:
I am seeing AI shifting jobs among different industries, instead of simply ‘stealing’ jobs from human. Most well-defined repetitive labor task could be easily replaced by AI and robots. I am expecting more jobs created to develop and customize AI. To me, Strong AI is still far far way from us. Until then, AI is more like an advanced tool to assistant human instead of replacing human. Many existing jobs would require workers to upgrade skills to leverage AI as the new tool. Previous example is with introduction of personal computer, lacking computer skills would make a worker losing his/her job. And some new jobs could be created for hybrid human+AI solutions. For example, Palantir is solving some very hard problems by combining AI and human intelligence.
To learn more about voice assistants, join Tao at the AI Assistant Summit in San Francisco next month - January 25 & 26. Other confirmed speakers include Ofer Ronen, Chatbase, Rushin Shah, Facebook, Alok Kothair, Apple, Pararth Shah, Google, Jane Nemcova, Lionbridge, Ann Thyme-Gobbel, Sound United and many more which you can view here.