Revisiting Small Batch Training for Deep Neural Networks
The size of the batches used for stochastic gradient descent (or its variants) is one of the principal hyperparameters that must be considered when training deep neural networks. Often very large batches are used to induce a large degree of parallelism and achieve higher throughput on todays hardware. But what is the cost to training performance? This work investigates the effect of batch size on training modern deep neural network and shows that smaller batches improve the stability of training and achieve better test performance.
Dominic is a Research Engineer at Graphcore focusing on understanding and improving the fundamental learning algorithms used for deep neural network training. He did his undergraduate degree and Masters in Mathematics followed by a PhD applying optimization methods to aerodynamic design.