The most widely-used exploration methods in reinforcement learning today (like entropy regularization and epsilon-greedy) have not changed much in the last 20 years. Google Brain argues that these exploration strategies are naive and misguided in large action spaces. Google Brain present UREX, a policy gradient algorithm that explores more in areas of high reward. They motivate UREX mathematically by showing that its objective is a combination of expected reward and a mean-seeking Kullback–Leibler (KL) divergence with the "expert" policy. Moreover, we show that UREX empirically performs better than standard methods on a suite of algorithmic tasks.
Ofir Nachum works at Google Brain as a Research Resident. His research focusses on sequence-to-sequence models and reinforcement learning. We interviewed Ofir ahead of his session at the Deep Learning Summit in San Francisco, 26-27 January, to find out what started his work in deep learning, what UREX is, the main challenges being addressed in the deep learning space and his prediction for deep learning in the next 5 years.
What started your work in deep learning?
While working at Quora I was first seriously introduced to machine learning as part of our recommendations system. It was clear that smart machine learning methods had a huge impact on metrics and so I tried to read and learn more about them. It was at the time that deep learning started to really take off so I naturally got pushed to learn about deep learning methods. Through reading about how others use deep learning to solve previously intractable problems and trying it out myself I became convinced that this paradigm would become the key to important advances in the near future.
Can you provide a brief overview of UREX?
UREX is a policy gradient method for solving reinforcement learning (RL) problems. Specifically we take the basic goal of maximizing expected reward and try to augment that with some sort of exploration. The usual form of exploration in other methods is mostly random. Our formulation, on the other hand, prioritizes exploration based on reward. That is, it explores more in areas that have high reward.
UREX does so by making a correspondence between an action's log-probability under the policy and the resulting value (reward) of that action. In such a correspondence, an action should be explored more if its log-probability underestimates the resulting reward. Thus one can see UREX as trying to unify policy and value approaches to RL.
While this tries to provide the intuition behind UREX, there is theory behind it as well! We actually can show that our formulation is intimately related to minimizing a Kullback–Leibler (KL) divergence between the agent policy and an "optimal" policy.
What are the main types of problems now being addressed in the deep learning space?
What are the practical applications of your work and what sectors are most likely to be affected?
RL is a field that assumes much less supervision than what is used currently in real-world machine learning systems. Thus any advancement in RL would mean that the next generation of applied machine learning systems will require less supervision.
Outside of your own field, what area of deep learning advancements excites you most?
I'm excited about the recent advancements in generative models and interested in what ways they'll be applied to solve real-world problems.
What developments can we expect to see in deep learning in the next 5 years?
I imagine that the image and translation systems we have now will scale up to even more unimaginable accuracy and be applied in ways that significantly affect our everyday lives.
Join Ofir Nachum and other great speakers including: Shubho Sengupta, Research Scientist at Baidu, Ilya Sutskever, Research Director at OpenAI and Stacey Svetlichnaya, Software Development Engineer at Flickr at the Deep Learning Summit by registering here. There are only 10 tickets remaining! View the agenda here.
Upcoming Summits Include:
View the full events calendar here.