At the Deep Learning Summit in Boston next week, we'll be joined by Kelly Davis, Machine Learning Researcher at Mozilla who will talk about Deep Speech, an open source speech-to-text (STT) engine, that uses deep learning to democratize STT. The team at Mozilla understand the importance of speech as it brings a human dimension to our smartphones, computers and devices like Amazon Echo, Google Home and Apple HomePod. At the summit, Kelly will cover the technical details of at the heart of Deep Speech as well as the data and infrastructure required for its care and feeding.
We caught up with Kelly in advance of his presentation to get an insight of what we might learn:
I lead the Machine Learning Group here at Mozilla, an amazing team of technologists working together to create open source deep learning technologies. We're very focused on collaborative approaches that will increase access to speech and other technologies.
As manager, a lot of what I do is tending to the flock. I wake up early because my focus is the best in the morning. I usually start my day reading and writing research papers, taking my dogs for a walk around the Spree River (which runs through the middle of Berlin), eating breakfast, and answering emails.
After that, there’s a myriad of things I work on, starting with one-on-one meetings with my team. I check their progress on projects and provide support with any barriers they’re encountering in their work. We also run an internal journal club as advances in machine learning are happening at such a rapid pace. Weekly someone presents a research paper that they’ve found interesting over the week.
I meet with Mozilla external partners. Partners may want to help pool data resources for Common Voice, or talk about the internationalization effort that’s going on now in this next phase of the project. There are also internal groups at Mozilla that are using our software and I may meet with them to see what we can change or improve, or to understand new machine learning technologies that we can supply.
As to how I found my way into machine learning and how deep speech came to be, it’s a story in three parts.
Part one, late nineties Washington DC. I was working at a startup during the first internet boom, and a group of my friends, who were all in some way involved in this first boom, began to see, dimly, what’s largely now come to pass. Computers would talk to us, listen to us, and understand us. However, at the time the technology was only giving hints as to what has now to come to pass. So, instead of expressing this prognostication through founding a startup, we founded an art collective, Sentient Art. As part of our art installation work, we learned of and applied various neural network based machine learning techniques to train our pieces to act and react to gallery visitors.
Part two, 2011 Berlin, Germany. The early promise of the neural network techniques we worked on in the late nineties never seemed to bear fruit. They worked, but only just. However, in the interim, 2006 to be precise, a Deep Learning revolution had begun. So in 2011, the revolution was in full swing. In that year myself and friend from DFKI (German Research Center for Artificial Intelligence) began work on a startup, the goal of which was to create an agent capable of understanding and answering general knowledge queries using the web. In creating this agent we had to dive deeply into many machine learning techniques.
Part three, 2015 Berlin, Germany. On suggestion of an old college friend from MIT, I joined Mozilla to work on a virtual assistant for their smartphone operating system, Firefox OS. While Firefox OS didn’t blossom as expected, my work there exposed several gaping holes in the open source ecosystem. The most egregious was the lack of a production quality speech recognition engine and speech corpora to train it with. So, we got to work correcting this situation, creating a speech recognition engine Deep Speech and Common Voice, a crowd sourced means of creating speech corpora. Since then we expanded, working on further projects such as text-to-speech and automatic summarization.
With the rise of speech as a primary interface element, one of Mozilla's main challenges is putting speech-to-text and text-to-speech into Firefox, voice assistants, and opening up these technologies up for broader innovation. Speech has gone from being a “nice-to-have” browser feature to being “table stakes”. It is all but required.
The recent advances in deep learning based speech-to-text and text-to-speech engines have made coding such engines far more straightforward than was the case only five years ago. This has opened up many possibilities for Mozilla; however, there are still challenges ahead.
One of these is obtaining data sets to train our code on. Firefox is localized into about 100 languages. Obtaining enough training data for English speech-to-text is hard. But doing so in 99 other languages is almost impossible.
However, one great help in all of this is Common Voice and the Open Innovation Team at Mozilla. This team leverages the crowd and open practices to amplify our work. In particular, they lead project Common Voice that crowdsources the collection of open speech corpora, making the dream of supporting all of Firefox’s languages with speech one more step closer to reality. Without the support of the Common Voice community, we are nothing.
We are using machine learning to assure that the Internet is an integral part of people's lives, regardless of their accent, the language they speak, whether they are blind, or if they are analphabetic.
Speech-to-text and text-to-speech engines are a great help in this regard. However, most current engines are hobbled for at least two reasons. First, they are pay-walled. To develop a novel, new system that uses speech one usually must a pay a per sentence fee. Fair enough, but often this fee is prohibitive to new use cases or business models. Second, accents and languages not deemed profitable are excluded from these pay-walled systems. Fair enough, but this leaves many people out in the cold.
We've created Deep Speech and Common Voice, in part, to deal with just such problems. Deep Speech provides an open speech recognition engine that doesn't hide behind any pay-walls and which anyone can use for any accent or language. Common Voice provides an open means through which training data for Deep Speech, or any engine, can be collected. This empowers anyone to create a training set for any accent or language they desire.
Beyond being open, both of these projects are collaborative to the core. Deep Speech is open source. Anyone can contribute to the code base. As a rising tide lifts all boats, it’s also so with Deep Speech where several companies are starting to use it and contribute back, making the code better, enticing more to use the code and also contribute back. A virtuous cycle.
Common Voice has collaboration built in to its DNA. All the data collected through Common Voice is crowdsourced and is given back to the world as open data. In addition, Common Voice serves as a hub, bringing together an expert community to agree on the characteristics of the data we're collecting, which further improves the data, and thus attracts further experts and contributors.
Personally, I am, and have been for some time, most excited about developments which further the ability of computers to converse with us as a human would. I think this lies at the core of much of my motivation in working on machine learning.
The main threads I see are: decreased need for translators, taxi drivers, and trailer truck drivers. This final extrapolation is particularly troubling when coupled with the data of, say, this NPR Planet Money story The Most Common Job In Every State. In 2014 over half of the US states had as their most common job truck driver. So, a very real and pressing question that has to be answered sooner rather than later is: What happens to these displaced occupations? Do they evolve? Do they disappear? Does guaranteed minimum income come in to play?
I don't think anyone has the answer and I suspect there is no single answer, but I think the question bears careful consideration.
Common Voice provides speech corpora that are in many accents and languages, helping to de-bias speech technologies. In addition, it provides the tools for anyone to create such corpora if they have the desire to.
One of the main goals of Deep Speech is to provide production quality speech-to-text in an offline environment, enhancing security and privacy by not requiring users to stream their speech to remote servers.
Beyond my team's work, another means Mozilla is dealing with such problems is to support others, outside of Mozilla, who are tackling these problems head-on. To that end, through our Mozilla Research Grant, we are endeavoring to support work taking dataset bias into account when designing machine learning algorithms and we are also looking to support security and privacy work for speech recognition.
One thing we are particularly excited about is the imminent release of localized versions of Common Voice. The website is currently being localized to about 44 different languages and some, or all of these, will be ready for contributions in the coming weeks!
Join Kelly in Boston next Thursday and Friday 24 - 25 May to learn more.