Optimized Structure for Facial Action Unit Relationship Using Bayesian Network

Facial expression recognition has been a very important task for human-computer interactions. Computer vision techniques have been much employed to get the automated recognition of facial expression. Facial Action Coding System has best described on facial expression, which includes 46 action units that involve facial muscle movements. In this paper, the relationships between Action Units are modeled using Bayesian Network structure. Action Units relationship and modeling are explained, and the learning algorithms for Bayesian Network structure are discussed. The aim of this work is to get the optimized structure to best explain the relationships of AUs based on two AU-coded databases. The relationships among AUs are important for further facial expression recognition. Experiments demonstrate that a prior network is essential to start up and ease the searching process. Due to the complexity of the possible structures, constraints are applied to simplify the work. Bayesian Network parameters are also learned from two databases for different structures to do Bayesian Network inference and classification based on the data set. Average recognition rate for each AU is illustrated, and the overall recognition performance is analyzed to get the optimized structure to best describe the AU relationship.


Introduction
Facial expression recognition has been a very active topic in research in recent years. It is a very challenging task due to a lot of difficulties and limitations. Face is one's identity and unique for everyone, and it changes for different age. Computer vision techniques have been widely used to recognize facial expression from images, video but due to pose, illuminations and complicated combination of facial expression make the recognition task become very challenging. Ekman and Keltner developed Facial Action Coding System (FACS) [1] which describes facial expressions in terms of Action Units (AUs). FACS consists of 46 AUs which are primarily related to facial muscle movements. Action Unit has been used widely in recent research especially regarding facial expression recognition. However, this system requires human to go through the learning and training processes to become expert in coding FACS. FACS is developed for coding by hand, using human experts. Trained experts, who make perceptual judgment of video sequences, often frame by frame, currently perform this process. The limitation is that the system requires approximately 100 hours to train a person to make these judgments reliably and pass a standardized test for its reliability [2].
Computer vision techniques have been utilized to automate the action unit. However, due to richness of facial expressions, it is important to learn the relationships between AUs. It is hard to have only one AU involved in a facial expression. The AUs are related to each other as it involves the muscles of human face. In this paper, the relationship of 14 main AUs that were involved in 6 basic expressions (anger, disgust, fear, happiness, sadness, surprise) as described by FACS are studied and modeled using Bayesian Network (BN).

Related Works
Recent surveys on facial expression recognition were done in [3] and [4]. Facial expression analysis can be studied in two types: appearance-based and geometric-based. Overview of various approaches like optical flow method, local binary patterns, pyramid of histogram of gradient (PHOG) and local phase quantisation (LPQ) method and facial action coding system (FACS) were done in [3]. FACS is the most objective and comprehensive coding system in the behavioural sciences. On the other hand, challenges in face expression recognition in real life applications such as pose, illumination and occlusion have been issued in [4]. The survey gives also an overview of the methodology to be followed for facial expression recognition.
Optimized Structure for Facial Action Unit Relationship Using Bayesian Network Pitas et. al [5] proposed to use shape and texture information for facial expression recognition. The method is performed by fusing the texture and the shape information extracted from a video sequence using a subspace representation method and an Euclidean embedding in combination with a SVMs system, respectively. Mahoor et. Al [6] presented sparse learning approach for AU combination classification. Gabor features were extracted at the location of facial landmark points extracted using AAM to represent facial images. Then, a dictionary was developed to recognize the combination of facial AUs using L1-norm minimization.
A system to recognize facial action unit by exploiting their semantic and dynamic relationships using Dynamic Bayesian Network (DBN) is proposed in [7]. Probabilistic relationships among various AUs are presented and temporal changes in facial action units are accounted. Computer vision techniques are employed to obtain AU measurements. Such AU measurements are then applied to the DBN for inferring various AUs. However, the temporal links between AUs are limited as not temporal links for all target AUs.

Action Unit Relationship
The relationships between AUs are learned from AU coded database by considering their co-absence and co-occurence in an image. Hence the probabilistic relationship among AUs are computed and an initial structure that represents the relationships between AUs are learned in BN. Best structure is then found by taking the best score that fits the database. In this section, the databases used in this work are presented. Then, the AU relationship modeling and the learning algorithm are discussed.

Database
Two databases are used to learn the relationship of AUs using Bayesian Network. The first database used in our work is the Cohn-Kanade DFAT-504 database [8], which consists of sequences of images of more than 100 subjects. Facial behavior was recorded from subjects covering different races, ages and genders. Each sequence begins with a neutral expression and proceeds to a peak expression. The peak expression for each sequence is in fully FACS coded and given an emotion label. The first image of the sequence which reflect the neutral state of the subject and the last image of the image sequence that shows the peak expression of the subject are used to train the system.
The second database is RPI ISL Facial Expression database (RIFE) [7]. This database consists of 42 image sequences from 10 subjects displaying facial expressions undergoing a neutral-apex-neutral evolution. 38 image sequences from 6 subjects have been used for training in this work while the remaining images are used for testing purpose in our future work. The ISL database is coded into AU labels frame by frame.

AU Relationship Modeling
A BN is a directed acyclic graph (DAG) that represents a joint probability distribution among a set of variables. BN structure is used to represent knowledge about uncertain variables computational architecture for computing the impact of evidence on beliefs. Variables are depicted as nodes and arcs represent probabilistic dependence between variables. Computing posterior probabilities given evidence about selected nodes exploits probabilistic independence for efficient computation.
In this case, our variables are the AUs, which are denoted as nodes in the BN structure. The directed arcs denote the conditional dependency among AUs, which can be characterized by conditional probability table (CPT). An initial BN structure is derived by analyzing the AU relationships in AU-coded images from two facial expression databases as described earlier, similar to work done in [7].
The relationships among AUs are learned from their co-occurrence  Table I(a) represents the conditional probability P( = 1| = 1) whereas each entry in Table I(b) represents the conditional probability P( = 0| = 0) . The probability is a counting process for the AUs involved in the facial images from both databases [7]. After both tables are constructed, an initial BN structure is obtained.

Structure Learning Algorithm
As mentioned earlier, a best structure that best fits the database is our aim to model the relationships among AUs. After analyzing the AU relationships from Table 1, an initial structure is obtained. For this purpose, we need to learn the structure using searching algorithm to confirm our structure. There are two types of structure learning methods which are constraint-based and score-based methods [9]. The constraint-based learning method finds Bayesian network structure whose implied independence constraints "match" those found in the data. Scoring methods such as Bayesian, Minimum Description Length (MDL), Minimum Message Length (MML) find the Bayesian network structure that can represent distributions that "match" the data. We use score-based method, which is the most implemented method.. The structure learning algorithm determines the score of the structure and gets the best structure based on the data training. The score we use is Bayesian score. The score-based method is as follow: a) Define a quality metric to maximize the score, b) Use greedy search to determine the next best arc to add, and c) Stop when the metric does not increase by adding an arc. The Bayesian score that has been used in our model selection is defined in Eq.(1) where, B denotes network structure and D denotes a database of sampling data. The first term is the log likelihood and the second term is the log prior probability of the structure B. The log likelihood gives information about the database, which commonly adds a penalty term on the number of arcs. For a large database, the Bayesian information criterion (BIC) [10] has been used to compute the log likelihood.
Next, a network structure with highest score needs to be identified by a searching algorithm. Greedy hill climbing [11] search algorithm is used in our work. The algorithm starts with a given network, and at each iteration, makes evaluation on all possible changes. After that, it continues to move to the neighbor that has the highest score and reiterate; if no neighbors have higher score than the current point, the algorithm stops. A pseudo-code for algorithm is shown in Fig. 1

Learned Structures
Our focus in this paper is to learn a BN structure to best modeling the relationships among AUs. An initial BN structure is obtained after analyzing the relationships of AUs as shown in Fig. 2. Besides that, we also learn from a random constructed structure. Table 2 shows the scores of both learning structures using greedy hill-climbing algorithm. The final score from predefined prior structure is slightly higher than randomly constructed structure. The execution time for prior structure is also shorter compared to randomly constructed structure. Besides that, the scores of structures after each iteration are demonstrated in Fig. 3. It can be seen that the starting scores for prior structure is much higher than the randomly constructed structure. Hence, it is clearly shown that the manually constructed structure is a very good starting point to find an appropriate structure that fits the training data as well.    Other than that, for model simplicity, we tried to put constraints on the searching space where we set the limit on the number of parents for each node. First, each node has at most of two parents. Only the best two parents were chosen based on the score of the learning structure. We have tested on different case of which maximum number of parents for each node can have up to 6 parents. Scores of the structure and execution time are shown in Table 3. We can conclude that with increasing of number of parents' nodes, the learned structures give higher scores. However, at the same time, the execution time and the complexity of the structure are also increasing. The difference in score is not high, which is 6.6%, comparing structure for maximum 2 parents and the structure without constraint and the difference in execution time is 147%. See Fig. 4 for the learned structure with constraints.

Bayesian Network Inference and Classification Performance
To further analyze the 5 structures in previous experiment, we also learn the parameters of the five structures in Fig. 4 based on two AU-coded databases. The two databases have been divided into three sets, the training data, the testing data and observed data. The ratio is 2:1:1. The observed data is used as evidence to do BN inference on the testing data. Thus, the posterior probability of each AU is inferred from the structures. The probability of the inferred AU decides the presence or absence of AU in the image and thus classification of AU is performed. Average recognition rate for each AU is shown in Fig. 5(a). The average recognition rate is defined as the percent of examples recognized correctly. Overlapping of the marker in the figure shows that the average recognition rate for some AUs have not much changes for different structures. It shows that the complexity of structures has no significant effect on the AU recognition. Fig.5(b) shows the overall recognition rate for all AU for learned structures. It shows that with increasing complexity of structure, the recognition rate is slightly better. In assessing the classification result, ROC analysis is done and the overall of averaged area under curve (AUC) is presented in Fig. 6.

Conclusions
The AUs relationships are studied and learned using BN in this work to get an optimized structure to describe the relationship among AU. A prior structure constructed from two databases showed better result than a randomly constructed structure. It gives us rough picture of how AUs are related to each other and it is further confirmed by learning the structure. Considering the complexity and execution time for the learned structure, constraints on number of parents for each node are applied. Bayesian Network inference and classification for datasets are done for comparison of 5 structures with increasing complexity network. Scores of structures and classification results show that recognition rate is slightly increasing with increasing complexity of the 5 structures. However, the difference is not much, but the execution time for structure can be doubled by comparing structure 1 and 3. Therefore, the structure 1 which allows maximum best 2 parents for each node is favored. An optimized AU structure that represents the relationship among 14 AUs are important for further facial expression application.