Cracking the Cocktail Party Problem: Deep Clustering for Speech Separation
The human auditory system gives us the extraordinary ability to converse above the chatter of a lively cocktail party. Selective listening in such conditions is an extremely challenging task for computers, and has been the holy grail of speech processing for more than 50 years. Previously, no practical method existed in the case of single channel mixtures of speech, especially when the speakers are unknown. We present a breakthrough in this area using a new type of neural network we call deep clustering. Our deep clustering network assigns embedding vectors to different sonic elements of the noisy signal. When the embeddings are clustered the constituent sources are revealed. The system is able to extract clean speech from single channel mixtures of unknown speakers, with an astounding 10 dB improvement in signal to noise ratio -- a level of improvement previously unobtainable even in simpler speech enhancement tasks. Amazingly, the system can even generalize between two- and three-speaker mixtures. We believe this technology is on the verge of solving the general audio separation problem, opening up a new era in spontaneous human-machine communication.
Prior to joining MERL in 2010, John spent 5 years at IBM's T.J. Watson Research Center in New York, where he led a team in noise robust speech recognition. He also spent a year as a visiting researcher in the speech group at Microsoft Research, after obtaining his PhD from UCSD in the area of multi-modal machine perception. He is currently working on machine learning for signal separation, speech recognition, language processing, and adaptive user interfaces.