Scalable Natural Gradient Training of Deep Neural Networks
Neural networks have recently driven significant progress in machine learning applications as diverse as vision, speech, and text understanding. Despite much engineering effort to boost the computational efficiency of neural net training, most networks are still trained using variants of stochastic gradient descent. Natural gradient descent, a second-order optimization method, has the potential to speed up training by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large networks because it requires solving a linear system involving the Fisher matrix, whose dimension may be in the millions for modern neural network architectures. The key challenge is to develop approximations to the Fisher matrix which are efficiently invertible, yet accurately reflect its structure.
The Fisher matrix is the covariance of log-likelihood derivatives with respect to the weights of the network. I will present techniques to approximate the Fisher matrix using structured probabilistic models of the computation of these derivatives. Using probabilistic modeling assumptions motivated by the structure of the computation graph and empirical analysis of the distribution over derivatives, I derive approximations to the Fisher matrix which allow for efficient approximation of the natural gradient. The resulting optimization algorithm is invariant to some common reparameterizations of neural networks, suggesting that it automatically enjoys the computational benefits of these reparameterizations. I show that this method gives significant speedups in the training of neural nets for image classification and reinforcement learning.
Roger is an Assistant Professor of Computer Science at the University of Toronto, focusing on machine learning. Previously, he was a postdoc at Toronto, after having received a Ph.D. at MIT, studying under Bill Freeman and Josh Tenenbaum. Before that, he completed an undergraduate degree in symbolic systems and MS in computer science at Stanford University. He is also a co-creator of Metacademy, a web site which helps you formulate personalized learning plans for machine learning and related topics. It’s based on a dependency graph of the core concepts. Also, he recently taught an undergraduate neural networks course at the University of Toronto.