Interpreting Expression from Voice in Human-Computer Interactions
Millions of people reach out to digital assistants such as Siri every day, asking for information, making phone calls, seeking assistance, and more. The expectation is that these assistants should understand the intent of the user’s query. Detecting the intent of a query from a short, isolated utterance is a difficult task. Intent cannot always be obtained from speech-recognized transcriptions. A transcription-driven approach can interpret what has been said but fails to acknowledge how it has been said, and as a consequence, may ignore the expression present in the voice. In this talk, we will explore if a machine-learned system can reliably detect vocal expression in queries using acoustic and primitive affective embeddings. We will further explore if it is possible to improve affective state detection from speech using a time-convolutional long-short time memory (TCLSTM) architecture. We will demonstrate that using intonation and affective state information can help to attain a relative equal error rate (EER) decrease of 60% compared to a bag-of-word based system, corroborating that expression is significantly represented by vocal attributes, rather than being purely lexical.
Vikramjit Mitra is a Senior Research Scientist at Apple who is working on speech science and machine learning for human-machine interactions. Previously he worked as an Advanced Research Scientist at SRI International's Speech Technology and Research Laboratory from 2011 to 2017. He received his PhD in Electrical Engineering from University of Maryland, College Park in 2010. His research interests include speech for health applications, robust signal processing for noise/channel/reverberation, speech recognition, production/perception-motivated signal processing, information retrieval, machine learning, and speech analytics. One of his major research contributions is the estimation of speech articulatory information from the acoustic signal, and using such information for recognition of both natural and clinical speech, and the detection of depressive symptoms in adults. He led SRI’s STAR lab’s efforts on robust acoustic feature research and development, which led to state-of-the-art results in keyword spotting and speech activity detection in DARPA’s Robust Automatic Transcription of Speech Program. He has served as the PI/co-PI of several projects funded by NSF and has worked on research efforts funded by DARPA, IARPA, AFRL, NSF and Sandia National Laboratories. He is a senior member of the IEEE and an affiliate member of the Speech and Language processing technical committee (SLTC), and he has served on the scientific committees of several workshops and technical conferences.