|
|
|
|
|
Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation.
In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image.
We collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable since larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.
GAIA consists of a Variational AutoEncoder (VAE) (the orange modules) and a diffusion model (the blue and green modules). The VAE is firstly trained to encode each video frame into a disentangled representation (i.e., motion and appearance representation) and reconstruct the original frame from the disentangled representation. Then the diffusion model is optimized to generate motion sequences conditioned on the speech sequences and a random frame within the video clip. During inference, the diffusion model takes an input speech sequence and the reference portrait image as the condition and yields the motion sequence, which is decoded to the video by leveraging the decoder of the VAE.
In addition to predicting the head pose from the speech, we also enable the model with pose-controllable generation. We implement it by replacing the estimated head pose with either a handcrafted pose or the one extracted from another video. The results show that we can control the head poses while the lip motion is synchronized with the speech content.
Due to the controllability of the inverse diffusion process, we can control the arbitrary facial attributes by editing the landmarks during generation. Specifically, we train a diffusion model to synthesize the coordinates of the facial landmarks. The fully-controllable talking avatar generation is enabled by generating the mouth-related landmarks from the speech, and the rest is fixed to the reference motion.
To show the generality of our framework, we consider textual instructions as the condition of the avatar generation. Specifically, when provided with a single reference portrait image, the generation should follow textual instructions such as “please smile” or “turn your head left” to generate a video clip with the avatar performing the desired action.
GAIA is intended for advancing AI/ML research on talking avatar generation. We encourage users to use the model responsibly. We discourage users from using the codes to generate intentionally deceptive or untrue content or for inauthentic activities. To prevent misuse, adding watermarks is a common way and has been widely studied in both research and industry works. On the other hand, as an AIGC model, the generation results of our model can be utilized to construct artificial datasets and train discriminative models.