Baidu's Adam Coates on Deep Speech 2, New Open Source AI Tool & More

By Sophie Curtis on January 14, 2016

Original
Named one of MIT Technology Review's 35 Innovators Under 35 for 2015, Adam Coates is Director of the Baidu Silicon Valley AI Lab. Previously a post-doctoral researcher at Stanford, over the past six years Adam has been a leading researcher and advocate of deep learning, and was one of the early supporters of using high-performance computing (HPC) techniques the field. We spoke with him to hear more about his role at Baidu and today's release of Warp-CTC, his lab in Silicon Valley and more.

Today the Baidu Silicon Valley AI Lab released Warp-CTC, your first open source offering to the community - please tell us more about that.
We invented Warp-CTC to improve scalability of models trained using CTC while we were building our Deep Speech end-to-end speech recognition system. As there are no similar tools available, we decided to share it with the community. It's a really useful tool that complements existing AI frameworks. A lot of open source software for deep learning exists, but previous code for training end-to-end networks for sequences has been too slow. Our investment in Warp-CTC is a testament to our very strong belief in the power of deep learning combined with high-performance computing technologies.

Tell us about your role at Baidu, your lab and the recent development of Deep Speech.
I'm the director of Baidu's Silicon Valley AI Lab where our mission is to build AI technologies that will have a significant impact on hundreds of million of people. Our Deep Speech system has that scale of potential. We started working on Deep Speech in late 2014 and at the NIPS conference in Montreal in December 2015, we announced that Deep Speech 2 can now accurately recognize both English and Mandarin speech with a single learning algorithm. The Mandarin version is very accurate in many scenarios and is ready to be deployed on a large scale in real-world applications, such as web searches on mobile devices.

The Silicon Valley AI Lab, which we refer to as "S-VAIL," includes multiple research teams that work closely together. In addition to a great deep learning team working on the neural networks for Deep Speech, we also have a fantastic high-performance computing team that has created the systems that let us train all of the networks. There's a production team helping us figure out how to deliver it to users, and a product team that's exploring new ways to use the technology. This "end-to-end" model of research is exciting, and it's helped us surmount a number of hurdles that we faced along the way.

What challenges did you meet and overcome in the development of Deep Speech?
Initially, there was some skepticism outside of Baidu that something like Deep Speech could catch up to existing speech systems. There were concerns that we couldn't get enough data, that the models would be too big to deploy, or that the things we developed for English wouldn't generalize well to Mandarin to serve Baidu's user base. Each of these were real challenges—it was a lot of work to scale up our datasets and computational capability. We had to discover the best types of models and how to balance the needs of deployment against the need for high accuracy. But Deep Speech is advancing quickly and the team has put the initial concerns in the rearview mirror. 

Can you tell us more about the Batch Dispatch process used to deploy DNNs?
One of the challenging parts about deploying neural networks is the computational cost—big neural networks run best on GPUs where you can do a lot of arithmetic quickly and efficiently. The challenge, though, is that GPUs are at their best when operating on many pieces of data in parallel rather than serving single requests. In short, GPUs are built for throughput and not for latency—but latency matters a lot to users. To solve this, our team developed a way to combine the requests of many users into batches so that the GPU can serve them all in parallel without compromising too much on latency.

How is Deep Speech different from other speech recognition systems?
Most existing speech systems were built on a solid engineering strategy: break the problem into pieces, solve each one really well, then put the pieces together and iterate on each one to make improvements. However, for something as complex as interpreting words in an audio signal (especially in a crowded room or when listening to someone with a heavy accent) it's very hard to get all of the pieces to fit "just right", and so building or improving upon an existing state-of-the-art speech system requires a lot of experience and specialization. Deep Speech takes a different approach: it replaces all of these components with a large neural network that learns how to make everything "just right" automatically by learning from huge amounts of data. That makes it simpler and more flexible and it gets better as we train it on more data.

What industries do you see speech recognition having the biggest impact on?
In the short term, mobile devices are crying out for better speech systems. This is especially true in developing economies where many users connect to the internet for the first time with a mobile device or prefer using their mobile devices to using, say, a laptop. These users put much higher demands on their cellphones than I put on mine and I think Deep Speech can have a big impact there. Longer term, speech interfaces will be very pervasive. As we ask technology to help us in more complex scenarios, AI will need to make technology more understandable and useful. Speech will be a big part of that.

What developments can we expect to see in deep learning in the next five years?
We saw deep learning go from being "promising" to "dominant" for image recognition in only a couple years, so five years is a very long time in this field. I think Deep Speech presages a similar change in speech recognition. A little further out, deep learning is going to have a big impact in natural language understanding. I'm especially excited about this because it will be an important part of making technology easy to interact with.

Are there any technologies you’re excited about becoming commonplace in our daily lives? When do you think these will be available?
We're already carrying the hardware around with us: microphones, cameras, and a powerful processor. We have mobile devices in our pockets all day but I don't think we're really tapping the potential of what they can do yet. Transforming how we use mobile devices using the visual and audio capabilities they already have requires better AI systems---that's part of what SVAIL is aiming to achieve. The SVAIL team is focused on making it as easy to interact with a computer as it is to interact with a person.

To learn more about deep learning at Baidu, join us at the RE•WORK Deep Learning Summit, in San Francisco on 28-29 January. Andrew Ng, Chief Scientist at Baidu, will be speaking at summit alongside experts from Google, Twitter, MIT, OpenAI, Enlitic and more.

The Deep Learning Summit will be taking place alongside the Virtual Assistant Summit. Tickets are now limited, for more information and to register please visit the event page here.

Machine Learning Deep Learning AI Deep Learning Summit Voice Recognition Speech Recognition NLP Deep Learning Algorithms


0 Comments

    As Featured In

    Original
    Original
    Original
    Original
    Original
    Original

    Partners & Attendees

    Intel.001
    Nvidia.001
    Ibm watson health 3.001
    Acc1.001
    Rbc research.001
    Twentybn.001
    Mit tech review.001
    Kd nuggets.001
    Facebook.001
    Maluuba 2017.001
    Graphcoreai.001
    Forbes.001
    This website uses cookies to ensure you get the best experience. Learn more