Microsoft Research Asia has introduced VASA, a a framework for generating lifelike talking faces of virtual characters with appealing visual affective skills (VAS), given a single static image and a speech audio clip. The initial model, VASA-1, is capable of not only producing lip movements that are synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness.
It’s a compelling effect and an impressive first step. Of course, as technology like this evolves, there are some dark implications, but, as always, we need to have frank discussions about potential uses, and abuses, of the inevitable.
Here are just two examples:
Again, quite impressive considering the starting point was a single AI-generated photo. You can see the rest of the videos on the research page. There’s also an arXiv and PDF of the paper.