Towards Datacenter-Scale Deep Learning with Efficient Networking
With the rapid growth of model complexity and data volume, deep learning systems require more and more servers to perform parallel training. Currently, deep learning systems with multiple servers and multiple GPUs are usually implemented in a single cluster, which typically employs Infiniband fabric to support Remote Direct Memory Access (RDMA), so as to achieve high throughput and low latency for inter-server transmission. It is expected that, with ever-larger models and data, deep learning systems must scale to multiple network clusters, which necessitates highly efficient inter-cluster networking stack with RDMA support. Since Infiniband is only suited for small-scale clusters of less than thousands of servers, we believe RDMA-over-Converged-Ethernet (RoCE) is a more appropriate networking technology choice for multi-cluster datacenter-scale deep learning. Therefore, we endeavor to incorporate RoCE as the networking technology for deep learning systems, such as Tensorflow and Tencent's Amber. In this talk, I will overview the technical challenges and present our progress towards datacenter-scale deep learning.
Chen Li is the co-founder of Red Bird Technology, and currently a Ph.D. candidate in CS at HKUST. He is working on topics in networking and parallel computing topics with Prof. Kai Chen. He is a Microsoft Research Asia PhD Fellow, and has published 10+ peer-reviewed papers in top journals and conferences. His network acceleration subsystems and scheduling algorithms have seen deployment in Big Data systems in Huawei and Tencent.