Nagging doubts about the risks of AI technology to generate deep-fakes refuses to go away, owing to the dubious accountability scenario
“We’re entering an era in which our enemies can make it look like anyone is saying anything in any point in time,” Obama begins. “Even if they would never say those things. So, for instance…” he continues, gesturing with his hands, “they could have me say things like President Trump is a total and complete dipshit!” His eyes seem to glimmer with a hint of a smile. Obama continues: “Now, you see, I would never say these things, at least not in a public address.” Obama never did say those things. The video was fake – a so-called ‘deep-fake’, created with the help of Artificial Intelligence (AI).The voice was eerily natural.
A few months ago, millions of TV viewers across South Korea were watching the MBN channel to catch the latest news. At the top of the hour, regular news anchor Kim Joo-Ha started to go through the day’s headlines. It was a relatively normal list of stories for late 2020 – full of Covid-19 and pandemic response updates. Yet this particular bulletin was far from normal, as Kim Joo-Ha wasn’t actually on the screen. Instead she had been replaced by a “deep-fake” version of herself – a computer-generated copy that aims to perfectly reflect her voice, gestures, and facial expressions.
Machine generated voice often sounded a tad metallic, a bit clunky or in other words robotic. Impressive strides were being made in human sounding machine voices. Google created Cloud Text-to-Speech, Microsoft offered Azure Cognitive Services Text to Speech, IBM launched Watson Text-to-Speech, and Baidu provided Text-to-Speech. Microsoft has taken this further than any other company; it has launched Custom Neural Voice (CNV) that uses AI to generate a specific person’s voice.
Last year Open AI’s Generative Pre-trained Transformer 3 (GPT-3) an autoregressive language model that uses deep learning to produce human-like text, shocked the world at how accurate the algorithm was in mimicking text written by humans. It was the most powerful tool ever in spitting out convincing streams of text in a range of different style when prompted with an opening text. It was so scary that Open AI decided not to make it widely available. Now, Microsoft’s CNV uses AI to generate natural language that sounds exactly like a specific person. This is a crucial shift away from a generic to a specific person’s voice. Once again it raises the crucial question of Algorithmic Accountability.
The nagging doubts about the risks of this technology to generate deep-fake news refuses to go away despite Microsoft’s claim that it has considered the implications of CNV and prioritizes responsible use of technology. It is terrifying to think of the consequences of CNV in the wrong hands as a dangerously potent tool to create deepfakes.
There is worldwide concern over false news and the possibility that it can influence political, economic, and social well-being. To understand how false news spreads, an MIT experiment used a data set of rumour cascades on Twitter from 2006 to 2017. About 126,000 rumours were spread by ∼3 million people. False news reached more people than the truth; the top 1% of false news cascades diffused to between 1000 and 100,000 people, whereas the truth rarely diffused to more than 1000 people. Falsehood also diffused faster than the truth.
This is precisely why the question of Algorithmic Accountability becomes so vital in the ethical use of technology because every day we witness how algorithms are making appalling errors. In December last year Stanford Medical Center’s misallocation of covid-19 vaccines was blamed on a distribution “algorithm” that favoured high-ranking administrators over frontline doctors. The hospital claimed to have consulted with ethicists to design its “very complex algorithm,” which a representative said, “clearly didn’t work right.”
Recent developments have enabled speech generation to sound more natural, and companies often offer a range of voices and dialects that customers can select to customize interactions. For example, Google’s Cloud Text-to-Speech offers more than 180 voices across over 30 languages and variants. However, the ability to imitate an individual’s speech (each person has a unique prosody, which is the tone and duration of phonemes, or units of sound) takes natural generation to another level. It also underscores the urgency for discussions related to Responsible AI.
Imagine that a computer, with enough sample data, could be trained to sound like any individual, and to say anything. Similarly, deep-fake videos, which use machine learning to generate visual content, can be made to depict individuals doing or saying just about anything, with improving quality. It doesn’t take a highly creative individual to see how it could be used for fraudulent or malicious purposes. It doesn’t take a highly creative individual to see how it could be used for fraudulent or malicious purposes.