An Application of Gradient Boosted Decision Trees (GBDT) for Query Runtime Prediction
In 2016, Uber's went through an exponential growth. To support such an unprecedented growth in business, Uber had to rapidly scale out its infrastructure. The rapid growth in the core business and the number of products offered by Uber coupled with an increase in the number of data scientists and analysts meant an increasing load on our data systems. Specifically, having a stable and an efficient data warehouse was a fundamental requirement to sustain the pace at which Uber was growing and operating in the marketplace. The starting point of every data-driven decision at Uber typically begins with fetching data using our internal web-based tool called Querybuilder. More than 150K SQL queries are issued every week on our data warehouse through Querybuilder. To manage this load efficiently, it is very important to be able to accurately predict the runtime of the query before it hits the warehouse. In this study, we share our approach on how we used a Gradient Boosted Decision Tree model for predicting the runtime class of a query (short query or long query). Further, the predicted label was used to route the incoming query into an appropriate queue for execution in real-time. This approach of routing the queries based on the predictions from the classifier increased the overall efficiency of our warehouse. We observed a significant decrease in the average waiting for a query execution and increased the throughput of our system under peak load.
Abi Komma is a Senior Data Scientist in the Applied Machine Learning group at Uber. He focuses on applying regression, classification, and survival analysis techniques to build predictive models for various business problems at Uber. He holds a Master's degree in Transportation Engineering from University of Florida and a Bachelor’s degree in Civil Engineering from Indian Institute of Technology (IIT) Madras.