Do Transformers See Like Convolutional Neural Networks?
Deep learning capabilities have arguably entered a new stage since around 2020. These new capabilities also rely on new design principles, with self-supervision, large scale pretraining and transfer learning all playing central roles. Perhaps most strikingly, this new stage of deep learning makes critical use of Transformers. Initially developed for machine translation tasks, then rapidly adopted across different NLP tasks, Transformers have recently shown superior performance to Convolutional Neural Networks (CNNs) on a variety of computer vision tasks. As CNNs have been the de-facto model for visual data for almost a decade, this gives rise to a fundamental question: how are Transformers solving visual tasks, and what can we learn about their successful application? In this talk, I present results providing answers to these questions. I overview key differences in image representations in Transformers and CNNs, and reveal how these arise from the differing functions of self-attention vs convolutions. I highlight connections between pretraining and learned representations, and explore ramifications for transfer learning and the role of scale in performance.
Maithra Raghu is a Senior Research Scientist at Google Brain and finished her PhD in Computer Science at Cornell University. Her research broadly focuses on enabling effective collaboration between humans and AI, from design to deployment. Specifically, her work develops algorithms to gain insights into deep neural network representations and uses these insights to inform the design of AI systems and their interaction with human experts at deployment. Her work has been featured in many press outlets including The Washington Post, WIRED and Quanta Magazine. She has been named one of the Forbes 30 Under 30 in Science, a 2020 STAT Wunderkind, and a Rising Star in EECS.