Understanding Categorical Embedding in Deep Learning
In insurance, structured data provides valuable information to the problems that we want to solve, and often this information is captured as categorical variables. Therefore manipulating categorical variables into the right format for Machine Learning (ML) models becomes an essential part of the ML pipeline. Common approaches such as one-hot encoding and mean response encoding might work for particular types of models. However, the process of transforming categorical data can be challenging, very time consuming and result in loss of information and sparse input to Neural Network (NN) models. At LV, we are leveraging the success of word embedding, where sematic and syntactic relationships are captured between words. We are exploring the use of categorical embeddings in order to obtain intrinsic properties of categorical data to improve model performance. Furthermore, in this presentation, we will explore different ways of understanding the embedding space.
Dapeng Wang is a Senior Data Scientist at the insurance company LV=. He graduated in maths from the University of Cambridge and has an MSc from the University of Sussex. At LV=, Dapeng is leading in the adoption of Deep Learning across the company. He is currently developing the end to end pipeline to build and integrate Deep Learning within current LV= processes. Dapeng is also a frequent Kaggle competitor and Kaggle competition expert. Dapeng looks forward to using his experience to help the deep learning community find suitable and better implementation solutions for deep learning.