Google researchers have found a way to create video versions of people from just a single still image. This makes it possible, for example, to create a video of someone speaking from entered text, or to change a person's mouth movements to match an audio track in a language other than the one originally spoken. It also feels like a slippery slope towards identity theft and misinformation, but what is AI if not with a touch of frightening consequences.
The technology itself is pretty interesting: it's called Vlogger by Google researchers published the paper. In it, the authors (Enric Corona et al.) offer various examples of how the AI takes a single input image of a human – in this case I believe mostly AI-generated humans – and uses an audio file to create both facial and body movements with it match.
This is just one of a few possible use cases for the technology. Another option is to edit videos, especially the facial expressions of a video subject. In one example, the researchers show different versions of the same clip: one shows a presenter speaking into the camera, another with the presenter's mouth closed in an eerie manner, another with his eyes closed. My favorite is the video of the presenter whose eyes were artificially opened by the AI without blinking. Huge serial killer vibes. Thanks, AI.
The most useful feature, in my opinion, is the ability to swap an audio track for a video with a dubbed foreign language version and have the AI lip-sync the person's facial movements to the audio track.
It works by using two stages: “1) a stochastic human-to-3D motion diffusion model and 2) a novel diffusion-based architecture that extends text-to-image models with both temporal and spatial controls. This approach enables the generation of high-quality videos of variable length, easily controlled by high-quality representations of human faces and bodies,” the GitHub page says.
2. Generation of Moving and Talking People Here is an example of generating talking faces with just a single input image and a driving sound. pic.twitter.com/hd7HKDfYkPMarch 18, 2024
Admittedly, the technology is not perfect. In the examples provided, the mouth movements exhibit certain characteristics common to AI-generated video content. It's also pretty scary at times, as noted by users who responded to a thread about the technology EyeingAI on X. But vlogger doesn't have to fool everyone or anyone at all to be of benefit. If this were a more perfect technology, it would be even more worrying to think about how this technology could be used to create deep fakes, spread misinformation, or steal identities. One day we'll get there, and I for one hope that by then we'll have a bit more of a handle on how to deal with this stuff.