A series of new California laws will soon take effect that address the use of digitally altered AI “deepfakes” for political or campaign purposes surrounding an election. These come in response to a deepfake video that mimicked the voice and likeness of current Vice President and presidential candidate Kamala Harris that was shared on X (Twitter) back in July. While the possibilities and potential dangers of deepfake technology are currently stll being debated, as well as how (if at all) it should be regulated, the technology itself is quite a fascinating study. So, debates aside, let’s look this month at how deepfakes actually work from a technical standpoint.
What is a deepfake?
The word “deepfake” is actually shorthand for “deep learning fake”. Deepfakes are images, audio, or videos, created or manipulated through the use of AI machine learning, that depict a person saying or doing something they didn’t really say or do in real life. Celebrities and political figures are the most common targets, but deepfakes can be created of literally anyone – living, deceased, or non-existent.
How are they created?
The creation of deepfakes can take anywhere from seconds to hours depending on the complexity of what’s being faked and how realistic it needs to be. The technology itself is also rapidly improving and becoming more accessible every day. Anyone with some experience working with photo or video editing software can drop a person or an object into a scene and make it appear they were actually there, but that’s not actually considered a deepfake. The key to creating a deepfake is using a deep-learning artificial neural network called a variational autoencoder. An autoencoder is a machine learning tool that can encode and compress large amounts of input data down to a lower dimensional “latent space”, and then reconstruct and output a decoded version of that latent data. “Which means what?” I hear you ask.
Imagine you want to teach a machine to come up with new and exciting pizza recipes. Using a variational autoencoder, you could hypothetically feed it (pun intended) every pizza recipe on the internet as input so that the machine can learn from that data what it is that makes a pizza … a pizza. The autoencoder compresses all that input data down to the most basic elements (i.e. pizzas are round, pizzas are topped with cheese, pizzas can include toppings, etc.) so that the machine can understand how to construct or reconstruct a pizza recipe on its own.
In the case of video deepfakes, autoencoders must be fed hours and hours of video footage of the person to be faked – from multiple angles, in different lighting, exhibiting various emotions – so that the machine can detect facial features and learn what makes that person recognizable. Once the basic data is established, the machine can then map that data onto another person or model with similar features (“face swap”). Audio deepfakes are built the same way, condensing hours of recorded speech down to key vocal inflections and tones so that it can be reconstructed by the machine and make the person “say” anything you want.
Deepfakes are also relying more and more on Generative Adversarial Newtworks (GANs) to up the realism factor. A GAN is another type of deep learning algorithm that can automatically determine how realistic the output from the autoencoder is. A GAN uses an internal testing engine called a “discriminator” to compare generated output against a real sample (i.e. a sound bite of the person being faked) to see if it can fool itself. If the output doesn’t pass the realism test, it uses that knowledge to then adjust the generated output until it does. It goes on like this, constantly fooling itself and then updating its own test until the output is nearly indistinguishable from reality.
How are they used?
The most complex deepfakes are created solely for research purposes and are used to study and improve machine learning. Deepfakes have also been used in the entertainment industry to de-age film actors playing younger versions of themselves, or to replace actors that died before the completion of a project. But as we discussed a few months ago, we must also acknowledge the high potential for misuse when it comes to AI, as deepfakes are commonly used for revenge porn, fraud schemes, cyber-bullying, and political manipulation. Technology itself is benign, but the intent behind it will always be the driving force behind whether it’s viewed as “good” or “bad”.
Steve Shannon has spent his entire professional career working in tech. He is the IT Director and Lead Developer at PromoCorner, where he joined in 2018. He is, at various times, a programmer, a game designer, a digital artist, and a musician. His monthly blog "Bits & Bytes" explores the ever-evolving realm of technology as it applies to both the promotional products industry and the world at large. You can contact him with questions at steve@getmooresolutions.com.