DeepMind Improves Speech Output Through Artificial Intelligence
The folks at DeepMind have blown the top off speech output with a new capability called WaveNet. Already famous for using artificial intelligence (AI) to beat a professional Chinese Go player, the Alphabet subsidiary disclosed Sept. 8 how far it has come along the path toward nearly flawless text-to-speech conversion. The published paper is here.
Most people, if they ponder it at all, think of this problem as solved. Siri, Echo, Cortana, Nuance, Google’s existing text-to-speech app, all more or less render speech. Mathematically, speech output has been thought of as simpler than speech input, which involves the hard problem of recognition, of understanding. Output is just: have sound; play it.
But most computerized speech still seems wooden, essentially because it’s made up bits and pieces of sound, phonemes or syllables. These elements are stored in some location and called when needed. Arranged altogether, they have difficulty expressing any human grace. They have their preprogrammed intonation, which sounds jarring when certain elements are juxtaposed.
DeepMind approaches the problem at an atomic level, absorbing tremendous quantities of raw audio data — signal amplitude per time, or the loudness of a sound at each instant — which can be depicted as a wave on a graph. What’s amazing here is that when you look at raw audio (à la Lev Rubin, the voiceprint-reading prisoner in Solzhenitsyn’s In the First Circle), “it’s seemingly impossible to discern even syllable boundaries,” says Dave Baggett, who did work in the field of computational phonology in the past and more recently founded Inky, the slick new secure email client. “You can’t tell the difference easily between bleeps and bloops from a numbers station and somebody speaking English.”
And taking this raw signal alone as input, WaveNet’s model can be trained to do more accurate text-to-speech conversion than we’ve ever seen before. Just listen to the audio samples in the WaveNet blog post. And remember, raw audio is what the ear receives. WaveNet’s waves just represent the amount of air pressure hitting an eardrum at a given moment, how hard the sound is pushing the eardrum. That’s the input for human speech learning. So, a very simple training model can “learn” to fluently “pronounce” English or Mandarin.
Huh. A simple model run iteratively on a tremendous amount of data with huge processor resources equals fluency. Sounds like a human brain. Oh, maybe that’s why we’re so good at learning to talk. Well, one of the reasons.
The thing is, the ability to train a simple network to pronounce English fluently calls into question — at least up to a certain point in the linguistic “stack” — theories that posit the existence of a universal grammar that babies are born with. WaveNet doesn’t have any built-in language model to start with. It’s just a tabula rasa with a simple training model that throws a ton of computing resources at a huge corpus of raw audio.
In language philosophy, the argument between tabula rasa and universal grammar remains central. If humans start with a bare-bones audio processing mechanism and language is 99% learned, then universal grammar might be fiction. If a general learning mechanism can learn language, then linguistics has a simpler explanation for how humans came to speak and understand, an explanation that better accords with an evolutionary explanation than does a big hairy universal grammar mutation. If language is so complicated, how could we have gone from chimps (no language) to humans (full language) with only a small set of mutations?
A simple story of repurposing a mutation that makes use of an existing learning mechanism is much more plausible than the sudden installation of an entire complex grammar. Chimps just don’t have that one mutation and we do. Otherwise, our brains are essentially identical. This rationale fares much better from an Occam’s Razor (law of parsimony) point of view.
Noam Chomsky posited the existence of a magic genetic endowment present only in humans, built into our brains from birth, that allows us to learn human language. But if you don’t need a universal grammar to learn language, then most of what he wrote about linguistics is dead wrong. Universal grammar explains learnability at the expense of evolutionary plausibility, which seems out of focus. And it’s quite possible that phonology is just the tip of the iceberg. Syntax, which is higher up the stack and where most of the complexity of the universal grammar theory lies, may or may not be addressable by this type of phonological learning. It’s possible that syntax is amenable to a similar approach.
If WaveNet — which looks more like a glimpse of uncharted territory than a solitary result — can deliver learnability without a universal grammar, then we may not need all the old linguistic baggage to explain human language learning.