Machine learning programs, like those used by Alexa and Siri, don’t record every word. Indeed, the voice can speak languages Bachchan may not know
“What’s the weather, Alexa?
” And the answer comes in a voice few in India wouldn’t recognise. Alexa
has got an Amitabh Bachchan
baritone. So you have the “Angry Young Man” (now in his 70s) taking orders or telling whether it’s going to be hot and muggy today.
behind the voice is stunning. Every word spoken by “Alexa
Bachchan” is computer-generated, not a recording. It takes linguistics, audio technology
and, above all, artificial intelligence to put together Alexa’s voice-activated service to “understand” natural language and respond with apparent intelligence.
The difference between computers and humans becomes most apparent when we consider the contrasting things the two find easy or difficult. A computer can crunch thousands of long numbers quickly. But it struggles to make sense of a simple sentence like “Lock the door”.
Speech patterns are as unique as handwriting. We compress speech in different ways (even long-winded speakers like Shashi Tharoor do). And spoken language is rarely grammatically correct. We assume context. When a cricket commentator says “Bumrah is wicked”, he’s referring to Jasprit Bumrah’s bowling skills, not his character. When you say, “Switch on the pump at six” you don’t specify if you mean am or pm. We also stop in mid-sentence, hesitate and repeat ourselves.
In addition to grammatical challenges, languages use constrained alphabets to represent multiple sounds. English, for example, has a 26-letter alphabet but around 44 phonemes (different sounds). In a famous example, George Bernard Shaw showed how “Ghoti” could be pronounced “Fish”. Moreover, different people with different accents will pronounce the same phoneme set differently.
It is impossible for Bachchan, or anyone else, to recite every word in the dictionary. Some degree of modulation is also necessary — much of the fun of Bachchan responding to your commands would disappear if “he” used a flat, lifeless voice like a bad YouTube recording.
Machine learning programs, like those used by Alexa and Siri, don’t record every word. Indeed, the voice can speak languages Bachchan may not know. The voice is recorded saying multiple things, and the noises are “broken up”, reassembled, sliced and diced. Timbre, tone and modulations are analysed using neural nets in the same way a violin or guitar may be analysed and synthesised. There are demonstrations where these technologies have been used to get Hitler and Stalin to perform duets.
All the voice-operated assistants do this for multiple male and female voices, in different accents. Amazon has done this earlier with American male voices like Samuel Jackson and John Legend, though behavioural scientists say people prefer women’s voices giving instructions.
Understanding what a user is saying is an even harder process for computers. Voice-operated assistants (even relatively simple voice-to-text dictation programs) need to “train” for specific accents and train further for specific owners. They must use a lot of contextual learning and this takes many hours. The fact that Alexa recognises different voices — one account can have up to ten different users — is part and parcel of this.
Alexa allows third-party uses too. The program can be taught “skills” and any programmer can learn to integrate their offerings with Alexa. The thrust into India makes obvious business sense. India has roughly 500 million Smartphone users and cheap data rates (even if it’s among the slowest networks in the world).