Microsoft's new VALL-E AI can Capture Your Voice in 3 seconds

upnorth · Jan 11, 2023

Microsoft researchers have presented an impressive new text-to-speech AI model, called Vall-E, which can listen to a voice for just a few seconds, then mimic that voice – including the emotional tone and acoustics – to say whatever you like.

It's the latest of many AI algorithms that can harness a recording of a person's voice and make it say words and sentences that person never spoke – and it's remarkable for just how small a scrap of audio it needs in order to extrapolate an entire human voice. Where 2017's Lyrebird algorithm from the University of Montreal, for example, needed a full minute of speech to analyze, Vall-E needs just a three-second audio snippet.

The AI has been trained on some 60,000 hours of English speech – mainly, it seems, by audiobook narrators, and the researchers have presented a swag of samples, in which Vall-E attempts to puppeteer a range of human voices. Some do a pretty extraordinary job of capturing the essence of the voice and building new sentences that sound natural – you'd struggle to tell which was the real voice and which was the synthesis. In others, the only giveaway is when the AI puts the emphasis in strange places in the sentence.

In terms of emotion, the results are less impressive. Using samples of speech marked as angry, sleepy, amused or disgusted seems to send things off the rails, and the synthesis comes out sounding weirdly distorted. The implications of this sort of tech are pretty clear; on the positive side, at some point you'll be able to have Morgan Freeman narrate your shopping list as you ride a trolley down the supermarket aisle. If an actor dies halfway through a movie, they can finish their performance through deepfaked video and audio using systems like this. Apple has recently introduced a catalog of audiobooks read to you by an AI, and it stands to reason that you'll soon be able to flip between narrators on the fly.

On the negative side, well, it's not great news for voice actors and narrators. Or indeed for listeners; AI might be able to pump out narrations quickly and extremely cheaply, but don't expect much art to it. They won't interpret Douglas Adams like Stephen Fry. The potential for scam artists is also sky-high. If a scammer can get you on the phone for three seconds, they can steal your voice and call your grandma with it. Or bypass any voice-recognition security devices.

Microsoft's new VALL-E AI can capture your voice in 3 seconds

Microsoft researchers have presented an impressive new text-to-speech AI model, called Vall-E, which can listen to a voice for just a few seconds, then mimic that voice – including the emotional tone and acoustics – to say whatever you like.

newatlas.com

Search

Microsoft's new VALL-E AI can Capture Your Voice in 3 seconds

upnorth

Level 68

Microsoft's new VALL-E AI can capture your voice in 3 seconds

Similar threads