Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Visual Computing

Neural Face Modelling and Animation

We present a practical framework for the automatic creation of animatable human face models from calibrated multi-view data. Using deep neural networks, we are able to combine classical computer graphics models with image based animation techniques. Based on captured multi-view video footage, we learn a compact latent representation of facial expressions by training a variational auto-encoder on textured mesh sequences.


Schematic overview of the proposed framework


We capture face geometry with a simple linear model that represents rigid motion as well as large-scale deformation of the face. Fine details as well as the appearance of complex face areas (e.g. mouth, eyes) are mainly captured in texture-space.


Results of our facial performance capture process. Each rows shows a different facial
expression. The left and middle column show the textured face model (left: wireframe,
middle: with directional light). The right column shows the original video frames.
It is worth noting that complex areas like mouth and eyes are properly captured and
can be reconstructed in high quality. Even though tongue, teeth and gums are captured
only in texture space, the rendered model looks realistic.


Our facial performance capture process outputs textured mesh sequences with constant topology. Based on these textured mesh sequences, we learn a latent representation of facial expressions with a variational auto-encoder (VAE). By simultaneously training a GAN loss-function, we force the texture decoder to produce highly detailed textures that are almost indistinguishable from original ones. The VAE serves now as a neural face model, which synthesizes consistent face geometry and texture according to a low-dimensional expression vector. Instead of training one neural model for whole face, we train multiple local models for different areas, like eyes and mouth.


Architecture of the neural face model. The network consists of five parts: a convolutional texture
encoder/decoder (blue), a geometry decoder (green), a fully-connected bottleneck (yellow) that
combines information of texture and geometry into a latent code vector (𝜇), and deviation (𝜎). The texture discriminator network (orange) classifies textures as real or as synthetic.


Based on our neural face model, we develop novel animation approaches. For instance, an example-based method for visual speech synthesis. Example-based animation approaches use short samples of captured motion (e.g. talking or changes of the facial expression) to create new facial performances by concatenating or looping them. Our method uses short sequences of speech (i.e. dynamic visemes) from a database in order to synthesize visual speech for arbitrary words that have not been captured before. Our neural face model offers several advantages that increase the memory-efficiency and improve the visual quality of generated facial animations. Instead of working with original texture and mesh-sequences, we can store motion samples as sequences of latent expressions vectors. This helps reducing memory requirements by a large margin and eases the concatenation of dynamic visemes as linear interpolation in latent space yields realistic and artifact-free transitions.

This picture shows synthesized visual speech sequences for words that are not part of the viseme database.






Wolfgang Paier, Anna Hilsmann, Peter Eisert Neural Face Models for Example-Based Visual Speech Synthesis, Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Production (CVMP 2020), London, UK, Dec. 2020.

Wolfgang Paier, Anna Hilsmann, Peter Eisert Interactive Facial Animation with Deep Neural Networks, IET Computer Vision, Special Issue on Computer Vision for the Creative Industries, 2020.