Fast Large-Scale Optimization by Unifying Stochastic Gradient & Quasi-Newton Methods
We present an algorithm for performing minibatch optimization that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged by quasi-Newton methods. These approaches are unified by maintaining an independent Hessian approximation for each minibatch. Each update step requires only a single minibatch evaluation (as in SGD), and each step is scaled using an approximate inverse Hessian and little to no adjustment of hyperparameters is required (as is typical for quasi-Newton methods). This algorithm is made tractable in memory and computational cost even for high dimensional optimization problems by storing and manipulating the quadratic approximations for each minibatch in a shared, time evolving, low dimensional subspace. Source code is released at http://git.io/SFO .
Jascha is a postdoctoral scientist at Stanford University in Surya Ganguli's group. He received his PhD from UC Berkeley in 2012, working with Bruno Olshausen. Jascha's research interests include machine learning, neuroscience, statistical physics, and dynamical systems. Past projects include developing new methods to fit large scale probabilistic models to data, using large scale probabilistic models to capture functional connectivity in the brain, analyzing multispectral imagery from Mars, using Lie groups to capture transformations in natural video, and developing Hamiltonian Monte Carlo sampling algorithms. You can find more information at http://sohldickstein.com/.