Pet dog facial expression recognition based on convolutional neural network and improved whale optimization algorithm

Table of Contents
In this section, we will illustrate a series of experimental results to reflect the superiority of using the IWOA, including benchmark function test and facial expression recognition experiment. In the benchmark function test, the basic WOA, IWOA and other relevant intelligent optimization algorithms will be utilized to search the optimal solution of several benchmark functions. In the facial expression recognition experiment, we employ a variety of classifiers for comparative experiments, including Support Vector Machine (SVM), LeNet-550, unoptimized CNN, CNN optimized by the basic WOA (WOA-CNN), CNN optimized by the IWOA (IWOA-CNN). They are applied not only to the dog expression recognition, but also to the human expression recognition based on several ready-made datasets.
Benchmark functions test
We adopt five kinds of intelligent optimization algorithms in this test, PSO51, GWO52, SSA53, basic WOA and IWOA are included. They are utilized to search the optimal solution of eight distinct benchmark functions. The unimodal functions of the eight functions are shown in Table 1, and the multimodal functions are shown in Table 2. To ensure the fairness of this test, keep the following parameters consistent during the experiment: the dimension of each function is set to 30, the maximum iteration number of each algorithm is 500, and the population size is 100. All algorithms are coded in Python, and the experimental platform is a PC with windows 10 operating system, Inter Core i5 CPU @2.60 GHz, GP107 GPU and 16 GB memory space.
For the sake of reflect these algorithms’ performance figuratively, the convergence curve is used to describe the process of searching for function optimal solution. As shown in Fig. 11, the processes of searching the optimal solution by the five algorithms are compared. It proves that IWOA has the fastest convergence speed, and the iterative results are also the best among the five algorithms. Especially in the exploration of F5, the final fitness value obtained is significantly better than other algorithms. Under the combined action of the nonlinear convergence factor and the adaptive weight, the convergence speed of IWOA to the global optimal solution is accelerated. The differential mutation strategy implemented for the population effectively increases the diversity of the population and helps the algorithm jump out of the local optimum in time, which is obviously reflected in the convergence curve of F4. It is worth noting that although GWO can find better solutions than WOA in many cases, but the time consumption and parameters of WOA are less than GWO, and these two algorithms are both proposed by Mirjalili et al., while WOA is proposed later. Hence, WOA is a more worthy algorithm to study. In a word, this experiment indicates the superiority of using IWOA to search the optimal solution of function. Therefore, IWOA is an excellent algorithm for solving optimization problems.

The convergence curves of benchmark functions.

The architecture of LeNet-5 network.
Facial expression recognition experiment
We have collected 315 images of pet dogs. After the image preprocessing mentioned in “Image pre-processing” section, these images are cropped to 48 × 48 pixels, and we obtain the dataset of dog’s facial expression, which contains 3150 images classified into five different expressions (normal, happy, sad, angry and fear). The dataset is divided into two parts: training set and validation set, in which the validation set accounts for 20%. Then, different classifiers are applied to classify these expression images, including SVM, LeNet-5, CNN, WOA-CNN and IWOA-CNN, their parameter settings and architectures are described in Table 3.
Among these classifiers, SVM does not belong to the neural network model and it does not have the function of image feature extraction. Thus, the histogram of oriented gradients (HOG)54 is utilized to solve this problem, after using it to extract image feature, SVM is utilized to classify these features. Here we refer to this image classification method as HOG–SVM. The categorical cross entropy function is used as the loss function by these network models in the above classifiers, and SVM uses hinge loss function to assess the classification accuracy of all categories. We let the network model train the image data for 200 epochs. In this case, the recognition accuracy and loss obtained from the experiment will tend to converge. We take the accuracy, loss and confusion matrix of expression recognition as the evaluation metrics of the experimental results. The accuracy of expression recognition includes the accuracy of training set and validation set, it is also called recognition rate and calculated by the following formula:
$$Recognition \,Rate = \fracTPTP + FN$$
(18)
where TP and FN indicate the number of true positive cases and false negative cases in the evaluation results, respectively. The loss represents the error between the predicted value of the sample and its true value. Generally, the smaller the loss, the higher the accuracy. The accuracy of all classifiers in dog facial expression recognition is shown in Fig. 13. Owing to the loss function used by SVM is different from that of other classifiers, so the training losses of all network models in this experiment are presented in Fig. 14.

The accuracy of all classifiers.

The training losses of all network models.
From the perspective of recognition accuracy, network models can obtain the higher accuracy than the SVM, and using CNN model is better than other methods in this experiment. After introducing the basic WOA to optimize the parameters of this model, its recognition accuracy is not improved very much by the reason of the optimization ability of WOA is not perfect. However, IWOA improved the recognition accuracy of the original model by more than 3 percentage points. It indicates that IWOA can effectively help the model obtain better operating parameters to improve the recognition accuracy.
The confusion matrix results of these network models in dog facial expression recognition are illustrated in Fig. 15a, b, from which we will be aware of the specific situation of sample classification, that is, the normal and happy facial expression categories can be discriminated with a higher recognition rate, while the recognition accuracy of the sad and fear categories does not exceed 90%, it may be ameliorated by using a deeper network model.

The confusion matrix results of all network models in dog facial expression recognition (generated using Python’s seaborn library). (a) The confusion matrix of the LeNet-5. (b) The confusion matrix of the CNN. (c) The confusion matrix of the WOA-CNN. (d) The confusion matrix of the IWOA-CNN.
To observe the process of model training in detail, the accuracy curve and loss curve in the training process of all network models are presented in Fig. 16a, h. These curves illustrate that the LeNet-5, whose architecture is shown in Fig. 12, attain the lowest accuracy of these network models owing to the insufficient number of convolutional layers and without the dropout layer. Moreover, its loss curve is also the most unstable (unable to converge), which indicates that the image features it learned are relatively superficial. However, the CNN model can achieve a higher accuracy, and its loss curve seems more stable, especially after its parameters are optimized by the WOA, and yet the convergence speed of WOA-CNN’s accuracy curve is not significantly faster than that of CNN, result from the parameters optimized by the WOA are still not good enough. Instead, the convergence speed of IWOA-CNN’s accuracy curve and loss curve is faster than others, and it can also achieve the highest accuracy and the lowest loss.

The accuracy curve and loss curve of model training in dog facial expression recognition. (a) The accuracy curve of the LeNet-5. (b) The loss curve of the LeNet-5. (c) The accuracy curve of the CNN. (d) The loss curve of the CNN. (e) The accuracy curve of the WOA-CNN. (f) The loss curve of the WOA-CNN. (g) The accuracy curve of the IWOA-CNN. (h) The loss curve of the IWOA-CNN.
In terms of runtime efficiency, since the process of using HOG-SVM for recognition is to extract image features before model training, which is also different from using network model (the training process includes image feature extraction and recognition), we compared the single training duration of each network model in this experiment. The comparison results are presented in Fig. 17, from which can be seen the LeNet-5 takes the shortest time of these network models due to the simplest model architecture, while the three CNN-based models take a longer time. Of the three, IOWA–CNN takes the longest time, WOA–CNN takes the second place, and CNN takes the shortest time. It is mainly caused by the difference in the number of effective neurons and learning rates at runtime. After WOA optimization, the keep probability of dropout layer in WOA–CNN is about 0.73, and that in IWOA–CNN is about 0.77, which are both higher than that in WOA. Therefore, WOA–CNN and IWOA–CNN need to calculate more neurons than CNN at runtime. By reason of exponential decay on the learning rate, the learning rates of the three CNN-based models will gradually decrease with the training process. The change of the learning rate of the three CNN-based models is shown in Fig. 18, from which we can see that in order to achieve higher recognition accuracy, the learning rate of WOA–CNN and IWOA–CNN in the whole training process is lower than that of CNN. The initial learning rate and decay rate of IWOA–CNN are higher than that of WOA–CNN. On the whole, IWOA–CNN has the lowest learning rate. More neuron calculation and smaller learning rate will lead to an increase in training time, thus forming the difference of training time of each network model, as shown in Fig. 17. Although the optimized model will increase the training time, its improvement of recognition accuracy is also obvious. All things considered, IWOA–CNN has the best performance among all classifiers in this experiment.

Single training duration of all network models.

Learning rates of the CNN-based models.
Besides, we also have applied these classifiers to the human expression recognition based on several ready-made datasets, like Japanese Female Facial Expressions (JAFFE)55, CK+56 and Oulu-CASIA NIR&VIS facial expression database (Oulu-CASIA)57. JAFFE is a dataset of Japanese women that has 7 kinds of facial expression with 213 images of 256 × 256 pixel resolution. CK+ dataset contains 8 categories of expressions with 593 images of 640 × 490 pixel resolution. Oulu-CASIA contains 2880 image sequences with 6 different facial expressions under 6 different lighting conditions, and we select the last frame of all image sequences in 80 themes under the strong visible light scene (each theme corresponds to 6 different expression image sequences), there are 480 images for the experiment in total. The brief information of these datasets is shown in Table 4. These three datasets are all established in a laboratory environment, but their recognition difficulties are different. Among these datasets, CK+ and JAFFE have better image quality and their recognition accuracy is relatively high in many studies. Moreover, CK+ has more samples than JAFFE, so it is less difficult to recognize. Due to the influence of light, many expression images in Oulu-CASIA are not very clear, so Oulu-CASIA is the most difficult to recognize in these datasets.
To guarantee the correctness of the experimental results, we capture the facial region in the image and resize it to 48 × 48 pixel, then, the data enhancement technology is used to balance the number of samples of each category, and build virtual samples to expand the total number of samples. For each dataset considered, 80% is treated as the training data, and the other 20% is for validation. Let these classifiers train each dataset for 10 times, and average the results of each training as the final score. The recognition accuracy of all classifiers and the training losses of all network models on these datasets are presented in Fig. 19a–f.

The recognition accuracy and losses on different expression datasets. (a) Recognition accuracy on JAFFE. (b) Losses on JAFFE. (c) Recognition accuracy on CK+. (d) Losses on CK+. (e) Recognition accuracy on Oulu-CASIA. (f) Losses on Oulu-CASIA.
These above experimental results indicate that the recognition accuracy of human facial expressions is higher than that of dogs in most cases, which is due to there is a large number of dogs breeds, and the facial differences between different breeds of dogs are quite large. The recognition accuracy of the CK+ dataset is highest by the reason of the image quality in CK+ dataset is the best and the difference between expressions is obvious. Owing to the influence of light, the recognition rate of Oulu-CASIA dataset is comparatively low. From the performance of each classifier, the recognition accuracy of CNN model is higher than that of SVM, and these network models are utilized to train each dataset for 200 epochs, maybe the training is inadequate for some datasets. Since the WOA is applied to optimize the parameters of the original CNN model, the recognition rate has increased to a certain extent. Thanks to the strong optimization ability of IWOA, the IWOA–CNN attains the highest accuracy and the lowest loss. On the contrary, the recognition accuracy of LetNet-5 is relatively low because of the insufficient network depth, and the lack of measures to prevent over fitting leads to the accuracy and loss of the training set and validation set are quite different. To summarize, IWOA can greatly improve the performance of CNN model to achieve ideal results in facial expression recognition.
Informed consent statement
All images of pet dogs in this study are used with the permission of the dog’s owner, if the dog has an owner.