Microsoft’s VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.

Bill Loguidice April 22, 2024 1 min read

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

How VASA-1 works.

Microsoft Research Asia has introduced VASA, a a framework for generating lifelike talking faces of virtual characters with appealing visual affective skills (VAS), given a single static image and a speech audio clip. The initial model, VASA-1, is capable of not only producing lip movements that are synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness.

It’s a compelling effect and an impressive first step. Of course, as technology like this evolves, there are some dark implications, but, as always, we need to have frank discussions about potential uses, and abuses, of the inevitable.

Here are just two examples:

Example 1 with audio input of one minute long.

Example 2 with audio input of one minute long.

Again, quite impressive considering the starting point was a single AI-generated photo. You can see the rest of the videos on the research page. There’s also an arXiv and PDF of the paper.