Earlier this year, Shalini Ghosh, Director of AI Research at the Artificial Intelligence Center of Samsung Research America, joined RE•WORK at the Deep Learning Summit in San Francisco. As well as presenting on Deep Learning for Incremental Object Detection and Visual Dialog, Shalini also joined us in an episode of the Women in AI Podcast which is available to watch here. Here is a transcript of the interview:
From a young age, I was fascinated by math and the analytical sciences. I did my Bachelors in Physics (Honors) with a minor in Mathematics. During my MS and PhD in Computer Engineering, I worked on applying Information Theory, Probability and Optimization to some core problems in fault tolerant computing and error correcting codes. As a Principal Scientist at SRI International, I led several multi-disciplinary projects in ML and its applications to natural language understanding, cybersecurity, dependable computing as well as multi-modal applications like visual question answering.
In 2014 I was invited to be a Visiting Scientist at Google Research in Mountain View, where I spent more than a year working on applying deep learning (specifically Google Brain) models to problems in natural language processing and dialogue systems. In 2018 I joined Samsung Research America (SRA) as the Director of AI Research, where I’m the head of the Situated AI lab. I lead a team of researchers actively working on a wide range of AI topics, including computer vision, multi-modal learning (joint learning from language, vision and speech modalities), dialogue understanding and model explainability.
In the Situated AI Lab at SRA, we are working on a number of interesting research topics in Computer Vision and Multi-modal Learning. One of our core focus areas of research is Conversational Vision, where we work on problems like detecting objects in an image and having a natural-language conversation with an end user around those objects.
Conversational Vision is a novel framework of active vision that lies at the intersection of rich research areas like Computer Vision, Natural Language Processing, Dialog Understanding and Multi-modal Learning -- it’s a promising new area of research. Conversational Vision is a very useful feature in interacting with users, e.g., if a user takes a picture of a scene with a mobile phone camera and has some questions around it, Conversational Vision will allow the user to have a dialog around that image with the underlying ML system driving the interaction.
There are fascinating research problems that arise in this context. For example, one important aspect of allowing the user to have a conversation around an image is the capability of the ML model to identify objects in the image (e.g., picture has a “horse” pulling a “cart”). In some cases, the model may encounter objects that it does not recognize. For example, let us suppose that the ML model is trained to recognize fruits (e.g., apples, oranges, berries) and the user travels to a tropical country and takes a picture of java apples, then the ML model may not initially be able to identify this new fruit since the ML object detector was not trained on similar data. However, we are developing a capability where the user can interact with the model and teach it that the new fruit is “java apple” -- the ML object detector will learn about this new object in real-time, and will be able to recognize it from an image in the future. This is a very useful feature for end users, since new objects of interest occur all the time in a real user environment.
Another interesting problem in this context that we’re working on is Visual Dialogue, where we train an ML model to automatically respond to user queries in the context of an image -- this allows the user to have a conversation with the ML model around the image. Finally, we are also building explanation capabilities wherein an ML model will be able to explain why it made certain predictions (e.g., product recommendations) to the end user -- this is another feature that makes Conversational Vision user-friendly.
Object detection is a core computer vision task, where a ML model is trained to identify objects from a pre-specified set of object categories. In a real-life scenario, e.g., when an object detector is used to process the picture taken by a mobile phone camera, not all object categories are known to the ML model in advance since new objects of interest appear constantly in a user environment. As a result, it is important for object detection models to be continually learning -- they need to learn how to recognize new objects without suffering from the phenomenon of catastrophic forgetting, where the ML model forgets about old objects while learning about new ones.
In this talk, we discuss a new technology we have developed that can effectively do incremental learning for object detection in near-real time. We discuss the underlying mathematical framework of a novel loss function that enabled us to achieve state-of-the-art performance on benchmark datasets. We will also outline our efficient training and inference framework, which enabled our prototype system to successfully recognize objects in a real-world live demo scenario. We also discuss extensions of our incremental object detection work, where we can use auxiliary unlabeled data to get better models or use AutoML methods to automatically learn the best neural network architecture in the continuous learning mode.
We next give a brief overview of a novel recurrent neural network model with attention that we have developed for the task of Visual Dialogue, where the user initiates a dialogue with the system regarding a picture. We conclude by discussing how incremental object detection, improved visual dialogue, and other novel research contributions form the cornerstones of a new research framework of Conversational Vision.
Our team is working on a number of research projects, which could potentially have interesting applications in the real world, especially in the discipline of Computer Vision. One problem, which I outlined earlier, is Incremental Object Detection -- that’s useful for the real-world problem of identifying new objects using the mobile phone camera. Our model is also capable of “lifelong learning”, i.e., the model can keep learning about new objects that the user wants it to identify in a continuous fashion. Another problem that I discussed today was Visual Dialogue -- this is very useful as an application where the user interactively discovers information about items in a picture, potentially taken using a photo app. Visual Dialogue can also be used for assisting visually challenged users, in order to describe a visual scene to them and help them discover useful information via an interactive dialog session.
When ML algorithms are used in mission-critical domains (e.g., self-driving cars, cyber security) or life-critical domains (e.g., surgical robotics), it is often important to ensure that the learned models satisfy some high-level correctness requirements — these requirements can be instantiated in particular domains via constraints like safety, e.g., “a robot arm should not come within 1 meter of a human operator during any phase of performing an autonomous operation”. An example of a safety property from self-driving cars could be something like “when driving along a straight road, the autonomous controller for a self-driving car should not cross a double-yellow line”. Other domains where such safety constraints can be valuable are medical transcription, which is faced with ambiguity when translating text --- domain knowledge regarding safe drug prescription boundaries can mitigate erroneous transcriptions that could indicate an unhealthy high dosage for a medication.
At SRI, we worked on a new ML technique called Trusted Machine Learning (TML), where we not only train ML models on data but also constrain it to satisfy such safety properties. We applied the principle of TML to different types of ML models, e.g., probabilistic models like Markov Decision Processes that are used in many controllers, as well as Deep Neural Network models. We plan to extend our TML principle to other types of ML models, and also evaluate it on different domains. TML and similar approaches of enforcing critical safety properties of the domain of application can make ML models much more trustworthy.
AI/ML is having an astonishing impact in a wide spectrum of industrial applications. It has made a significant impact already in web and personal computing applications. In the coming years, we hope to see a wider impact of AI in personal robotics and devices (e.g., home devices). We should also see a deep impact of AI in other important areas like automobiles (self-driving cars), medicine (personalized medicine, medical robotics), etc. In all, I think AI will continue to have a wider and deeper impact in different aspects of society, improving our quality of lives significantly.
Our research team is actively continuing research in Conversational Vision. We also plan to expand the scope of our work into areas like Trusted Machine Learning and explore applications to Robotics. My website is shalinighosh.com -- that’s where you will find up-to-date information about me and my team’s work.
Watch more video interviews and presentations from the Deep Learning Summit here.