Accelerate the Distributed Model Training in Kubernetes cluster
In order to orchestrate Deep Learning workloads that scale across multiple GPUs and nodes, Kubernetes offers a compelling solution. With Kubernetes and Kubeflow Pytorch, we can easily schedule and track a distributed training job on single-GPU multi-node, multi-GPU single-node, and multi-GPU multi-nodes in a shared GPU resource pool. To accelerate deep learning training at Zoom, we enable RDMA, RoCE to bypass the CPU kernel and offload the TCP/IP protocol. We apply this technology in Kubernetes with SRIOV by NVIDIA Network Operator in a heterogenous GPUs cluster with 4 GPU servers and 8 GPU servers. By combining NVIDIA NCCL, Apex, and PyTorch PowerSGD, we can reach a near linear performance increase as the GPU number and worker node increases
Jack Jin is a lead machine learning infrastructure engineer at Zoom AI/ML, designed and built end to end Kubernetes cluster based ML platform for multiple Zoom ML teams on shared GPU resource pool, to run the distributed model training, like PyTorch DDP with Kubeflow PyTorchJob for accelerating the multi GPU multi nodes training performance with RDMA, RoCE and SRIOV. He also designed and built the data ETL, data exploration, big data processing and ML exploration system and infrastructure. Before joined to Zoom, Jack was MLOps Cloud Lead in Roche Genentech and Cloud consultant in IBM/Taos and was involved in building ML platform serving 500 data scientists of Roche globally.