UREX: Under-appreciated Reward Exploration
The most widely-used exploration methods in reinforcement learning today (like entropy regularization and epsilon-greedy) have not changed much in the last 20 years. We argue that these exploration strategies are naive and misguided in large action spaces. We present UREX, a policy gradient algorithm that explores more in areas of high reward. We motivate UREX mathematically by showing that its objective is a combination of expected reward and a mean-seeking KL divergence with the "expert" policy. Moreover, we show that UREX empirically performs better than standard methods on a suite of algorithmic tasks.
Ofir Nachum currently works at Google Brain as a Research Resident. His research focuses on sequence-to-sequence models and reinforcement learning, although his interests include a much larger area of machine learning. He received his Bachelor's and Master's from MIT. Before joining Google, he was an engineer at Quora, leading machine learning efforts on the feed, ranking, and quality teams.