Multimodal Learning with Deep Boltzmann Machines
Real-world data often consists of multiple modalities, for example, images are often accompanied by captions and tags; videos contain both visual and auditory information; robots receive data from visual, auditory and touch sensors. I will talk about a deep learning model that can extract a unified representation which fuses the multiple modalities together. The model is robust to missing data and can fill in missing modalities based on what is available. Our experiments on bi-modal image-text data show this model can be used to generate words given an image as well as retrieve images given some text.
Nitish Srivastava is a PhD student in the Machine Learning group at the University of Toronto working with Geoffrey Hinton and Russ Salakhutdinov. He is interested in using machine learning to create representations for images and videos that can help solve computer vision. He is working on object detection and action recognition. He is also interested in combining multiple data modalities into joint representations that can be used for cross-modal information retrieval. He has also worked on developing a new regularization technique that makes it possible to train very large and deep neural networks without overfitting.