Taming Signals, Features, and Training Datasets: How data management shaped Pinterest ML
This talk examines how Pinterest evolved its handling of three types of data -- raw signals, ML features and ML training datasets -- and the effects on ML practitioners at Pinterest. Data management is the core complexity of production ML engineering, especially in Web-scale applications with billions of entities and training examples. Pinterest “signals,” or raw data about Pins, boards, and other entities, started as monolithic datasets that grew unmaintainable. We split them into individually owned datasets on a standardized “Signal Platform,” improving governance around lineage, ownership, and monitoring. We standardized ML features from highly custom formats to a flat “Unified Feature Representation,” enabling a shared feature store and model inference. Finally, we are transitioning ML training datasets from ad-hoc row-oriented datasets to standardized columnar table groups, enabling improved storage efficiency and shared training pipelines in the future.
David is the Head of ML Platform at Pinterest, which comprises ML Data, ML Training, and ML Serving teams. These teams provide infrastructure for 200 engineers and data scientists for applications spanning ads, recommendations, search, and trust/safety, handling billions of events per day. Previously at Pinterest, David also started the Related Pins recommendations and visual search teams and built one of the first ML-based recommender systems at Pinterest. He holds a bachelor's and master's degree in computer science from Stanford.