Large Scale Inference Generation Using Jenkins/Blue Ocean and GPU Cluster
For our ecommerce use cases, we have several catalogs in O(106) items scale for which we need to generate embeddings on a daily basis. The catalogs are constantly expanding and updated. These embeddings could be created from TensorFlow models trained with catalog images and/or meta data for the items. For different use cases, the TensorFlow models and/or the catalogs are different, and we need a common platform for large scale inference generation. We are a very small team of 2 data scientists and 1 engineer. We have access to GPU cluster with different types of Nvidia GPUs, and these GPU servers are used primarily for training purpose.
Among the challenges we face are,
Selecting idle GPU servers with matching docker containers for them Distributing the workload of inference generation to idle GPU servers Automating the integration tests for model server and API server (model version matching the embeddings, as well as supported by the docker containers in production) Working with Walmart infrastructure for docker containers (security constraints, no nvidia-docker on GPU servers etc.)
Our solution consists of a Jenkins cluster with all the GPU servers added as worker nodes. We use parallel Blue Ocean pipeline to distribute the workload. For various Nvidia GPUs, we maintain docker containers that support the corresponding CUDA and Nvidia driver versions. We take this approach because it is easy to add new GPU servers, and also upgrade the docker images for GPU servers. We use object storage for both models and embeddings, which are stored in the same subfolders, in order to ensure that the generated embeddings and the TensorFlow model are ready for deployment together.
In my talk, I would also touch on the DevOps aspects of building docker images do’s and don’ts (keeping the docker images lean, learnings from building TensorFlow model server subject to enterprise constraints etc.) and deployment to Kubernetes (using init-containers, rolling updates etc.)
Future directions we are exploring:
TensorRT inference server Kubernetes on Nvidia GPUs Google TPU for inference
Binwei Yang is an engineer, a hacker, and a hustler. He is passionate about becoming a lifelong learner and equipping youth in underserved communities with high-tech skills. Binwei graduated from University of Southern California with a Master in Computer Engineering and Ph.D. in Physics, and has more than 20 years of professional experience of creating massively scalable customer-facing applications. He currently works on computer vision as a principal engineer for Walmart Labs.