Possible Research Directions (3D Face Classification (recognition of faces…
Possible Research Directions
Facial Landmark Localization
Geometric Deep Learning for Landmark Localization
The MoNet network as proposed in [
Monti et al., CVPR'17
] was applied for semantic segmentation of landmarks using the following steps:
For each RGB-D image, the ordered point cloud was 3D reconstructed, the outliers were removed and subsampling was applied to reduce the computational effort.
After localization and 3D reconstruction of the 2D landmarks, all points up to a certain distance were labeled with the index of the landmark and all other as background.
SHOT features were calculated for all points and MoNet was trained for classifying between the landmark index and background.
The method was able to localize the landmarks, but its accuracy and precision were low and not robust against expressions.
In general, such a geometric deep learning approach must be generalized to be applicable to whole point clouds instead of small subsets to perform tasks like authentication, basic emotion estimation and landmark regression (instead of segmentation).
Gilani et al., TPAMI'17
] generated depth maps of synthetic faces using a 3DMM. Afterward, they trained a CNN on RGB images of the normal vectors and the depth maps for binary segmentation between the landmark locations and background.
Szeptycki et al., BIOSIG'12
] segment the face by the sign of the mean and Gaussian curvatures and localize the segment with the nose tip afterward by classifying the curvatures.
3D-from-2D Landmark Localization
Zhang et al., ICPR'18
] and [
Bulat and Tzimiropoulos, ICCV'17
] train CNN's on RGBD images with annotated 2D landmarks to predict the 3D landmark locations for RGB images at the testing time.
Application of the Approach of [
Simon et al., ECCV'14
] for Facial Landmark Localization
To apply this approach to faces, the VGG Face CNN from [
Parkhi et al., BMVC'15
] was used and the gradient of the last (fifth) pooling layer wrt the input image was analyzed.
Instead of manually examining the sensitivity of channels to different parts of the face, the location of the maximum sensitivity of a channel wrt the bounding box or the index of the most sensitive channel to an annotated landmark location can be learned.
The peak of the resulting heat maps was obviously localized at the face in the testing images. However, they were mostly spread over the whole faces and not sharply localized and parts of the face were only localized in a few of the images.
Basic Emotion and Facial Action Unit Intensity Estimation
Gupta et al., IJCV'09
] derived the intensity of FAUs and emotions from angles and pair-wise distances between 2D landmarks.
Arriaga et al., arXiv'17
] applied an Xception CNN on the FER-2013 dataset and obtained 66 % accuracy in 22 ms per image on an i5 CPU. They provided their code on GitHub.
Estimation of Basic Emotions from 2D-to-3D-Landmarks
Since most of the existing approaches are not robust against variations of the facial pose, the following steps were implemented:
Detect the face and localize the 3D landmarks with 2D methods that are robust wrt pose variations e.g. [
Estimate a 3D affine transformation to normalize the alignment of the 3D landmarks to the x/y-plane and a unit distance between both eyes.
Train an SVM on the normalized landmark positions after applying both previous steps to a dataset like CK+.
The training dataset CK+ was too small in order to achieve a highly accurate and robust algorithm.
Furthermore, the localized 3D landmarks were inaccurate for greater pose deviations from the frontal view.
Facial Properties from Point Cloud Features at Landmarks
Since point cloud features like the invariant moments from [
Trummer et al., ICCV'09
] capture the deformation of the face, they can also be used for estimating facial properties like the degree of palsy and the intensity of basic emotions. After extracting the invariant moments for each point in the cloud, the following set of features can be extracted for the consecutive classification task:
Statistics like their means, moments and histograms of the cluster centroids after applying K-means or an HDP.
A subset of invariant moments at the position of 2D landmarks which were reconstructed into the 3D point cloud using the known depth map.
Hussain et al., arXiv'17
] derived the intensity of FAUs and emotions from angles and pair-wise distances between 3D landmarks.
3D Face Classification
(recognition of faces as identification, authentication or clustering)
Gilani and Mian, CVPR'18
] captures 3D face scans of 1785 IDs, generated 300 synthetic face scans using a 3DMM and calculated mean faces between them after an alignment step. Afterward, they generated the depth maps for all face scans from 15 viewpoints for each and represented the depth map together with the azimuth and elevation angles of the normal vectors as RGB images. Finally, they trained a CNN to classify between the faces and achieved almost perfect testing accuracies on all publicly available datasets.
Authentication using Point Cloud Features at Landmarks
Localize landmarks in 2D images and 3D reconstruct them using the known depth.
Calculate local point cloud features like SHOT descriptors [
Tombari et al., ECCV'10
] for each 3D landmark position and concatenate them.
Authenticate individuals using an SVM or a similarity metric between features like the Euclidean or cosine distance.
The approach with the concatenated feature vector achieved an F1-score of 92 % on 105 probands for the expression-vs-neutral setup where the neutral faces were used for training.
Another approach was to authenticate persons based on the interquartile ranges of the feature dimension. However, the decision on which dimensions to rely on for each subject turned out to be complicated.
Align the 3D face scan from the enrollment with the given 2D image of the face in the authentication stage and compare them in 2D or 3D (after reconstruction) in order to detect masks or heavy makeup.
Presentation Attack Detection
(liveness and mask detection)
Heart Rate Estimation
Detect a presentation attack by the following methods:
De Haan, Jeanne, TBME'13
] estimate the heart rate from the difference between two orthogonal chrominance signals of the color of a video of a face.
Liu et al., ECCV'16
] also estimate the heart rate from chrominance signals but track the changing color of local image patches between detected landmarks over time and perform correlation based statistics between these patches.
Liu et al., CVPR'18
] estimate a depth map of the face from single images using a multi-task CNN, which fits a 3DMM to the face. After registration of the 3DMM and mapping of the texture, they use an LSTM for estimating the heart rate. During training, they used the estimated heart rate from chrominance signals.
Offline vs. Online Calibration
Estimate the camera calibration from the detected facial landmarks in N images
A PAD is detected if the deviation between the estimated camera parameters for enrollment and authentification exceeds a threshold
Tang et al., NDSS'18
] emit a flashing light with a random color onto faces and learn the difference of the reflectance characteristics and the response times to distinguish between photographs or videos of faces and genuine faces.
Lagorio et al., IWBF'13
] estimate the depth map from stereo images and compare the mean curvature between bent photographs and genuine faces.
Wang et al., ICB'13
] 3D reconstruct the facial landmarks from a video and classify the concatenated coordinates between genuine faces and bent photographs.
3D Face Segmentation
The invariant moments, proposed in [
Trummer et al., ICCV'09
], were calculated for point clouds of faces. After a whitening of these features, the K-means clustering approach was applied to segment the face into facial parts with a similar shape.
The resulting segmentations were plausible and symmetric for both facial halves.
The invariant moments are not invariant wrt facial expressions. Therefore, this approach can only be applied for behavior analysis like tracking the cluster centroids over time and cannot be applied for authentication of individuals.
3D Face Reconstruction
Tewari et al., CVPR'18
] trained a 3DMM and two additive correction models (MLP's) using end-to-end learning and a self-supervision loss. The MLP's learn the deviation between the reflectance/geometry models (shape and expression) of the 3DMM and the actual appearance. In the loss function, the testing image is compared with the estimated 3D face model by applying an affine transformation and a pinhole camera model whose parameters are also learned, since they are differentiable.
Sela et al., ICCV'17
] Predicted the fine-grained facial 3D structure from a static 2D depth image using a 3DMM for rough registration, a VAE (U-Net) and
along the normals for the additional fine-grained details. The loss function of their U-Net incorporates the L1-losses of the normals and the depths.
In general, the parameters of a fitted 3DMM can be used for authentication, spoofing detection and behavioral analysis in 3D-Finder. However, the learned representations of most 3DMM's are not accurate enough for these tasks, except for both mentioned approaches above.
Geometric Deep Learning
(3D semantic segmentation and vertex classification)
Deep Learning for point sets
Su et al., ICCV'15
] applied a CNN to each depth map of a 3D object which is synthesized by virtual viewpoints, applied a pooling layer to their intermedia representations and another CNN for the final classification of the 3D object.
Qi et al., arXiv'16
] applied several MLPs to the nx3 array of 3D points to generalize convolutions to point clouds and to learn transformation matrices which make the learned model invariant to input permutations.
Vinyals et al., ICLR'16
] applied an RNN to point sequences or unordered list of points.
Wu et al., CVPR'15
] applied several 3D convolutions to a voxel representation of a 3D object in order to classify it.
Do not capture intrinsic structure (shape and neighborhood) and are therefore variant to deformations.
Permutohedral Lattice CNN [
Su et al., CVPR'18
] generalize convolutions to sparse and unordered input data points with a varying number by filtering in a higher dimension like (x,y,z,r,g,b). They implement convolution as point-wise multiplication with the permutohedral lattice grid by projecting all points onto its vertices and afterward back to their original position. This approach is originated from
of color images.
(invariant to surface deformations)
Graph Convolutions in Spatial Domain
Monti et al., CVPR'17
] performed convolution in a local polar coordinate system centered at each anchor point using learned GMMs.
Verma et al., CVPR'18
] rewrote the convolution operation as matrix multiplications and applied it to the nearest neighbors in a graph or mesh which were weighted by their distance.
Shenlong Wang et al., CVPR'18
] approximated continuous convolution by Monte-Carlo sampling.
Yue Wang et al., arXiv'18
] performed convolution by multiplying the edge weights of the graph with learned weights and aggregate the products to a new pointwise feature.
All methods except FeaStNet can only be applied to local portions of the point cloud for pointwise classification tasks. The learned filters are not invariant against translations since the receptive field changes when the underlying surface is deformed. A solution for that would be to apply several instances of MoNet to a fixed set of 3D landmarks and concatenate or combine their intermediate results to solve global point cloud classification or regression tasks.
Change the Loss-function of FeaStNet to classify individual shapes of faces.
Graph Convolutions in Spectral Domain
Bruna et al., arXiv'14
] applied point-wise multiplication in the frequency domain instead of convolution in the spatial domain in order to generalize CNN's to non-Euclidean domains like graphs and manifolds. The spectrum of the graph weights is given by the eigenvectors of the graph Laplacian.
Kostrikov et al., CVPR'18
] applied the Dirac operator instead of the Laplace operator to partial meshes in the frequency domain. Since the spectrum of the Dirac operator detects principal curvature directions, the learned filters are invariant to isometric deformations, similar to Geometric Deep Learning methods from spatial domain.
Do not allow for varying amount graph nodes.
Novelty Detection Ideas
Principal Component Separation of 3DMMs
A 3DMM like [
Gerig et al., FG'18
] can be used for sampling point clouds of faces with facial expressions by changing the weights of certain principal components. After dividing the number of principal components into a training set and a testing set, a novelty detection method can be trained on one subset of resulting point clouds and should classify the other subset as abnormal.
It is unclear which features should be extracted from the point clouds as the input of the novelty detection method. Furthermore, the aim of 3D-Finder is detecting novel behavior and thus, sets of point clouds for each principal component must be generated.
Limited to the expressiveness of the 3DMM and thus not able to account for micro-expressions.
Novel Face Appearance and Position
Novelty detection methods can also be applied for distinguishing between normal appearances and positions of faces. E.g. the head should be on top of the shoulders, the 3D shape of the head should differ greatly from a flat or bent photograph, the skin texture should contain pores, folds, and moles.
Probabilistic Methods and GANs
Probabilistic Face Modelling and Inference
Kulkarni et al., CVPR'15
] 3D reconstruct faces from single images using probabilistic programming. They compare a modeled hypothesis image I_R and an observed image I_D of a face using a simple distance measurement based on their feature representations. I_R is synthesized using a 3DMM, a pinhole camera model and a light source whose parameters are randomly drawn plus a structured noise process. Finally, the inference procedure iteratively generates images I_R which become more similar to I_D by maximizing the probability of the state.
Coelho de Castro and Nowozin, ECCV'18
] identificate detected faces based on probability distributions over their context (e.g. work, home, gym), identity, feature representations, label, and name. This Bayesian framework, therefore, allows for an unbound number of identified individuals, situational context-priors, misspellings of their names, wrong labels, and deviations of their feature representations. Furthermore, the context model allows for more confident predictions about people who tend to occur together in the same environment.
Facial Palsy Prediction
10800 images of patients were aligned, mirrored and fed into a GAN for training.
Afterward, the resulting latent spaced allowed for drawing realistic faces from all therapeutic exercises and for linearly interpolating between them and for changing facial attributes.
To further enhance the results, one should train a conditional GAN or VAE given the therapeutic exercise.
For facial palsy prediction, the degree of palsy must be known for all training samples and degree is estimated for a testing sample by obtaining the nearest neighbors from the training set in the latent space as explained in [
Schlegl et al., IPMI'17
Furthermore, it would be also possible to train the GAN on healthy faces and take the distance between the resulting manifold in latent space and a testing image of a patient as a measurement for the degree of palsy.
Another approach is to train a GAN on images of healthy subjects and another GAN on images of patients. If the landmarks locations were given in case of the healthy subjects, transfer learning can be applied between both manifolds in latent space to predict the landmark locations for the patients.
An algorithm for dimensionality reduction like t-SNE can be applied to the representations in latent space to find clusters that belong to certain paralysis degrees, performed exercises, or facial properties of the patients. Instead, an SVM can be used to directly classify the degree of paralysis in latent space.
Pumarola et al., ECCV'18
] train a GAN on images of facial expressions, where the magnitude of each AU can be given as a conditional variable. Therefore, the facial expression can also be synthesized afterward by combining several AUs.
Raymond et al., arXiv'16
] trained a GAN on images of faces. After introducing new residual and discrimination losses, they were able to obtain the closest training image in latent space to a given testing image. They randomly sampled an initial latent variable, calculated both new losses and performed backpropagation based on the new losses to iteratively obtain a latent variable that is the closest representation of the testing image. They used this approach for inpainting.
Schlegl et al., IPMI'17
] improved method of the point above and used the remaining loss after convergence as a measurement of novelty to detect the degree of a disease in images of the retina. Additionally, they used the remaining residual image for detecting anomalous regions in the testing image.
Similar to the last two points above, a pretrained GAN on faces could be used for predicting the degree of facial palsy of a patient. To apply this approach to obtain a novelty score for detecting spoofing attacks in 3D-Finder, it must be generalized to RGBD-images and/or time series.
The degree of facial palsy can also be predicted by minimizing the distance to a feature vector of all training images calculated by a pre-trained CNN for face identification like [
Parkhi et al., BMVC'15
Facial Behavior or Time Series Analysis
3D Landmark Tracking
Localize and 3D reconstruct detected 2D landmarks using the known depth.
Estimate a 3D affine transformation to normalize the alignment of the 3D landmarks to the x/y-plane and a unit distance between both eyes. Alternatively, the landmarks can be aligned with the initial position of the 3D landmarks.
Track the position or deviation of the landmarks over time. Additionally, the pairwise distances and angles between the landmarks of both facial halves e.g. the left and the right mouth corner can also be considered.
Estimate the similarity between the time series using DTW [
Schokoohi-Yekta et al., SIAM'15
] or a global feature extraction method (see [
Aghabozorgi et al., IS'15
The time series are usually very noisy and not aligned, even if the subjects perform the same exercises. Furthermore, there is a great variance in the way how the same exercise is performed by different subjects i.e. different facial muscles were activated.
Using these time series or intervals of these, the task of detecting abnormal behavior can be also solved using a novelty detection method. Similarly, the degree of facial paralysis can be inferred from the novelty score of a time series which corresponds to a patient with respect to a time series which corresponds to a healthy person performing the same facial expression. Moreover, the healthy and sick facial halves of the patient can be compared instead of faces of different persons against each other.
Affine Invariant Facial Shape Tracking, Analysis and Classification
Kacem et al., FG'18
] transformed detected 2D landmarks to
to achieve affine invariance. Afterward, a PCA is used to reduce the dimensionality of the time series of landmarks, Fisher Vector encoding, and an SVM is used to classify the time series into three
levels of patients.
Kacem et al., FG'18
] classified the
using the barycentric coordinates of a time series of landmarks. They used the Mahalanobis distance between landmarks as a similarity measurement over time and applied Dynamic Time Warping (DTW) to align the time series and obtain a final pair-wise similarity score. Afterward, they applied an SVM to the vector of similarities between the time series and all other samples in the dataset for classifying into the basic emotions.
Kacem et al., TMAPI'18
] derived the matrix of pairwise distances between all static landmarks from the Gram matrix as an affine invariant representation. The distances between these matrices were calculated in their underlying Riemannian and Grassmann manifolds (see [
]). However, I was not able to implement this distance and they did not provide their code. Similarly to the point above, they used DTW and an SVM to classify into
Desrosiers et al., IVC'17
] collected RGBD-videos of 16 patients with
. For each video and each RGBD-image, they performed preprocessing, detected the nose tip, aligned the face with the horizontally mirrored shape and extracted radial curves along the shape for each of 100 angular directions centered at the nose tip. After an elastic alignment of each radial curve with the mirrored shape, they used the magnitude of the deformation at each point of the curve as a measurement of the asymmetry between both facial halves, which they call Dense Scalar Field (DSF).
Zhen et al., TAC'17
] obtained the deformation of a face during a facial expression using the same DSFs on RGBD-videos between the initial RGBD-image and each consecutive recording after preprocessing each of the facial 3D shapes. After semi-supervised classification of the onset, apex and offset frames using the
and magnification of the DSFs, they vectorized the three DSFs and classified them using an SVM into the
For my purpose of detecting abnormal facial behavior, I also need a similarity measurement between two time series of landmarks (or whole shapes), which is invariant wrt affine transformations.
For classifying the degree of facial palsy, the approach of DSFs can also be used. Also, novelty detection based on the DSF can be used to measure the deviation between the face of a healthy person and a patient.