Reinforcement Learning Meets Sequence Prediction
Neural sequence to sequence models have seen remarkable success across a range of tasks including machine translation and speech recognition. I will give an overview of the dominant approach to supervised sequence learning using neural networks. Then, I will present optimal completion distillation (OCD) -- a new approach for training sequence models based on their own mistakes. Given a partial sequence generated by a model, OCD identifies the set of optimal suffixes and accordingly, teaches the model to optimally extend each prefix. OCD achieves the state-of-the-art performance on end-to-end speech recognition on standard benchmarks. In the second half of the talk, I will focus on sequence modeling tasks that involve discovering latent programs as part of the optimization. I will present our approach called memory augmented policy optimization (MAPO) that improves upon REINFORCE by expressing the expected return objective as a weighted sum of two terms: an expectation over a memory of trajectories with high rewards, and a separate expectation over the trajectories outside of the memory. MAPO achieves the state-of-the-art on standard semantic parsing datasets.
Mohammad Norouzi is a senior research scientist at Google Brain in Toronto. His research lies at the intersection of deep learning, natural language processing, and computer vision. His current research focuses on learning statistical models of sequential data and advancing reinforcement learning algorithms and applications. He earned the PhD in computer science at the University of Toronto, under the supervision of Prof. David Fleet working on scalable similarity search algorithms. He was a recipient of the prestigious Google US/Canada PhD fellowship in machine learning.