Neural Network based Multimodal Dialog Technologies toward Human-Robot Communication
Natural spoken language interaction between humans and robots has been a long-standing dream of artificial intelligence. However, traditional dialog systems rely on hand-crafted rules to support a limited task domain, such as query of information from a database. Recently, new machine learning methods have opened up the possibility for much more flexible dialog scenarios, such as conversation about events unfolding in the world. To this end, we introduce deep learning architectures such as attention-based sequence-to-sequence modeling and multi-modal attentional fusion. These models can generate unified semantic representations of natural language and audio-visual inputs, which facilitate flexible discourse about a scene. This work represents a key step toward real-world human-robot interaction.
Dr. Chiori Hori has worked on spoken language processing technologies since 1998. In 2002, she worked on spoken interactive Q&A using a real-time Automatic Speech Recognition (ASR) based on Weighted Finite-State Transducer (WFST) with over-a-million word vocabulary, at NTT. She joined CMU in 2004 and then moved to ATR/NICT in 2007. She led the NICT ASR research group and their system to first place in the English TED talk recognition at IWSLT for three consecutive years from 2012. She invented a WFST-based dialog technology which was implemented on a humanoid robot at NICT. She has been working on neural network based technologies for Human-Robot communication at MERL since 2015.