Viswanath Sivakumar

Understanding Text on Images at Scale

Understanding text that appears on images in social media platforms is important not just for improving experiences such as the incorporation of text into screen readers for the visually impaired, but they also help keep the community safe by proactively identify inappropriate or harmful content in a way that pure object detection or NLP systems alone cannot.

This talk describes the challenges behind building an industry-scale scene-text extraction system at Facebook that processes over 2 billion images each day. I'll cover the Deep Learning methods behind building models that perform detection of text in arbitrary orientations with high-accuracy, and how simple convolutional models work extremely well for recognizing text in over 50 languages. A critical aspect of the work is scaling up these models for efficient server-side inference. I'll dive into quantization methods to run neural networks with 8-bit integer weights and activations instead of 32-bit floating points, and the challenges involved in bridging the accuracy gap.

I’m a Researcher at Facebook AI Research working on machine learning for systems where I’m currently exploring reinforcement learning to improve the performance of computer networks. Prior to that, I was part of Facebook AI Applied Computer Vision Research group where I founded and lead the Rosetta project—a large-scale machine learning system for understanding text in images and videos. I had also made extensive improvements to the low-level performance and efficiency of Computer Vision models in production.

Buttontwitter Buttonlinkedin
This website uses cookies to ensure you get the best experience. Learn more