The Mona Lisa Can Now Talk, Thanks to EMO

March 1, 2024

Researchers at the Institute for Intelligent Computing at Alibaba Group have developed an AI tool known as EMO: Emote Portrait Alive, which brings portraits to life.

The tool enables users to add audio and video to a still image. Using the tool, one can play around with an old portrait like the famous Leonardo da Vinci’s La Gioconda, better known as the Mona Lisa, making her talk and sing with head poses, motion, facial expressions, and accurate lip sync.

Expressive audio-driven portrait-video generation tool

In their report, “EMO: Emote Portrait Alive: Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions,” the researchers give insights on their new tool, its functions, and how to use it for perfect results.

With the expressive audio-driven portrait-making AI tool, users can create vocal avatar videos with facial expressions. According to the researchers, the tool allows them to create videos of any duration “depending on the length of the input audio.”

“Input a single character image and a vocal audio, such as singing, and our method can generate vocal avatar videos with expressive facial expressions and various head poses,” said the researchers.

“Our method supports songs in various languages and brings diverse portrait styles to life. It intuitively recognizes tonal variations in the audio, enabling the generation of dynamic, expression-rich avatars.”

Also read: OpenAI Claims The New York Times “Hacked” ChatGPT To Develop A Copyright Case

Talking, singing from a portrait

According to the researchers, the AI-powered tool does not only process music but also accommodates spoken audio in different languages.

“Additionally, our method has the capability to animate portraits from bygone eras, paintings, and both 3D models and AI-generated content, infusing them with lifelike motion and realism,” said the researchers.

But it does not end there. Users can also play around with portraits and images of movie stars delivering monologues or performances in various styles and languages.

Some AI enthusiasts who took to the X platform described it as “mind-blowing.”

2. Mona Lisa talking Shakespeare pic.twitter.com/26k29aAz1P

— Min Choi (@minchoi) February 28, 2024

Thinning boundary between real and AI

News of the EMO tool by Alibaba has made other users think the boundary between AI and reality is about to disappear as tech firms continue unleashing new products.

“The edge between AI and real is thinner than ever,” posted Ruben on X, while others think TikTok will soon be flooded with the creations.

“This is the first time I have seen such a precise and realistic result. Video AI this year promises to be credible,” said Paul Covert.

While others think this could be a game changer for creatives, Min Choi is also cautious about it.

“Hopefully just for creative things. This could be dangerous in the wrong hands.”

Using the tool

Explaining the process, the researchers highlighted that the EMO framework has two stages, with the first known as Frames Encoding, where ReferenceNet is deployed to extract features from reference images and motion frames.

The next stage is the Diffusion Process stage, where a pretrained audio encoder “processes the audio embedding.” To create perfect facial imagery, users integrate facial region masks and multi-frame noise.

“These mechanisms are essential for preserving the character’s identity and modulating the character’s movements, respectively,” reads part of the explanation.

“Additionally, Temporal Modules are utilized to manipulate the temporal dimension and adjust the velocity motion.”