Training BatchNorm and Only BatchNorm:On the Expressive Power of Random Features in CNNs
Batch normalization has become an indispensable tool for training deep neural networks, yet it remains poorly understood. Although the normalization component of batch normalization has been most emphasized, batch normalization also adds two per-feature trainable affine parameters: a coefficient, gamma, and a bias, beta. However, the impact of these oft-ignored parameters relative to the normalization component remains unclear. In this talk, I will discuss recent work which aims to understand the role and expressive power of these affine parameters. To do so, we study the performance achieved when training only these parameters and freezing all others at their random initializations. Surprisingly, we found that training these parameters alone leads to high, though not state of the art, performance. For example, on a sufficiently deep ResNet, training only the affine batch normalization parameters reaches 83% accuracy on CIFAR-10. Interestingly, batch normalization achieves this performance in part by naturally learning to disable around a third of the random features without any changes to the training objective. In this way, this experiment can be viewed as characterizing the expressive power of neural networks constructed simply by shifting and rescaling random features, and highlight the under-appreciated role of the affine parameters in batch normalization.
Ari Morcos is a Research Scientist at Facebook AI Research working on understanding the mechanisms underlying neural network computation and function, and using these insights to build machine learning systems more intelligently. In particular, Ari has worked on understanding the properties predictive of generalization, methods to compare representations across networks, the role of single units in computation, and on strategies to measure abstraction in neural network representations. Previously, he worked at DeepMind in London, and earned his PhD in Neurobiology at Harvard University, using machine learning to study the cortical dynamics underlying evidence accumulation for decision-making.