Remember the late-night talk show where they show a picture of a political figure with another person's mouth over it to get them to say dubious things? It always looked a little tough, but that was part of the effect. Well, this new AI tool also takes still images of human subjects and animates their mouth and head movements, but this time the effect is surprisingly, almost worryingly convincing.
The tool is called EMO: Emote Portrait Alive, and it was developed by several researchers at the Institute for Intelligent Computing, part of the Alibaba Group. The tool takes a single reference image, extracts generated motion images, and then combines them with vocal audio through a complex diffusion process in which the facial area is integrated with noise samples from multiple images and then denoised while adding generated images to synchronize the audio finally becoming a video of the subject that not only lip-syncs but also highlights different facial expressions and head poses.
The technology is demonstrated using sample images of various characters, ranging from real celebrities to AI-generated humans to the Mona Lisa, while the vocal audio used includes a Dua Lipa track, pre-recorded interview clips and Shakespeare monologues. After the process is applied, the generated avatar appears to come to life by moving its lips and moving to the chosen tone.
The effect is surprisingly accurate, although it has to be said that it is far from perfect. “Boo” sounds sometimes seem to come from open mouths rather than closed lips, and every now and then a syllable comes from clenched teeth, as if the avatar is resisting the AI's insistence on bringing it to life for the internet to sing and perform.
This is stunning. This AI can sing, speak and rap expressively on individual images from any audio file! 🤯Introducing EMO: Emote Portrait Alive from Alibaba.10 wild examples: 🧵👇1. AI Lady from Sora sings Dua Lipa pic.twitter.com/CWFJF9vy1MFebruary 28, 2024
Still, it is a remarkable effect that is likely to pass unnoticed by a casual observer unless specifically told to pay attention to mouth movements and timing.
Even more impressive is a later demonstration of what the company calls “cross-actor performance.” One clip shows Joaquin Phoenix in full mask as the Joker, but this time with the tone of Heath Ledger's take on the character from The Dark Knight, including a reasonable approximation of Ledger's trademark sipping and smacking in the role.
While the technology is undoubtedly impressive, it is likely to do little to dispel the creeping notion that AI deepfake content and all the nefarious purposes for which it can potentially be used are advancing at a remarkable pace.
Although these videos are excellent technical demonstrations, they are a reminder that as image and video production technology matures, the difference between what we think is real and what is computer-generated is becoming increasingly difficult to tell. AI tools can sometimes demonstrate a frightening ability to produce generated content at incredible speeds and with increasing complexity, and this has some worrying implications. Although maybe that's just because I'm a big ol' worrier.
Will it not be long, I wonder, before our vacation photos can be pulled from our long-defunct Facebook pages and turned by AI tools into videos of us singing songs we never sang? At least that's my excuse.
No, I didn't try karaoke drunk in Cyprus. This is an AI-powered fake, I promise.