An emotion recognition system: bridging the gap between humans-machines interaction

,


INTRODUCTION
Human emotion recognition is a rapidly growing field that has gained significant attention in recent years due to its potential applications in various areas, including psychology, healthcare, education, and entertainment.Emotions are complex and subjective experiences crucial to communication, decision-making, and well-being.Understanding emotions is essential for effective human-robot interaction, personalized mental health interventions, and many other applications.
Recent technological advances have made it possible to detect and analyze human emotions using various techniques.Researchers have explored different approaches to recognizing and classifying emotions accurately, including machine learning algorithms, such as support vector machines, artificial neural Ì ISSN: 2722-2586 networks, and deep learning models.Other techniques involve analyzing physiological signals, such as electroencephalography (EEG), electrocardiography (ECG), and galvanic skin response (GSR), to extract features related to emotional responses.
Emotion recognition systems have numerous potential applications in various fields.For example, we can use emotion recognition to monitor and improve mental health conditions like depression, anxiety, and post-traumatic stress disorder (PTSD).In entertainment, emotion recognition can personalize video and audio content to the viewer's emotional state.In security, emotion recognition can detect and prevent crimes by identifying suspicious behavior and emotional states.
Emotion recognition is an important research field that has gained considerable attention in recent years due to its applications in various areas such as psychology, healthcare, robotics, education, and entertainment.Emotion recognition refers to identifying and analyzing human emotions from multiple sources, such as facial expressions, physiological signals, speech, and gestures.In this literature review, we will discuss the recent advances in emotion recognition from different modalities and the methods proposed by various researchers.

LITERATURE REVIEW
This section covers various state-of-the-art approaches to emotion recognition.We discuss four areas of emotion recognition: facial, speech, hybrid, and physiological signal-based methods.Numerous studies and techniques emphasize deep learning approaches and multimodal systems for improved accuracy.The research spans multiple applications, for example, human-computer interaction, real-world deficits, and physiological signals.

Facial emotion recognition
Jain et al. [1] contributed in classifying each image into one of six facial emotion classes.Balasubramanian et al. [2] covered the datasets and algorithms used for facial emotion recognition (FER).The algorithms used were Gabor filters [3], a histogram of oriented gradients (HoG) [4], and local binary pattern (LBP) [5] for feature extraction [6].Hassouneh et al. [7] aimed to classify physically disabled people (deaf, dumb, and bedridden) and autism children's emotional expressions based on facial landmarks, and electroencephalograph (EEG) signals using convolutional neural network (CNN), and long short-term memory (LSTM) classifiers by establishing an algorithm to recognize real-time emotion using virtual markers through an optical flow algorithm that works effectively in uneven lighting and subject head rotation (up to 25 • ), multiple backgrounds, and various skin tones.Mellouk et al. [8] aimed to study recent works on automatic facial emotion recognition (FER) via deep learning.Deep learning techniques in human-computer interactions were employed in [9] on the advancement of artificial intelligence as an efficient system application procedure.Hayes et al. [10] studied about how understanding real-world deficiencies and task selection in upcoming emotion recognition studies are affected by the variability in the age effects of various facial emotion recognition task designs.Ulusoy et al. [11] examined patients with BD, their parents, and healthy controls' capacity to recognize and distinguish between facial emotions.

Speech emotion recognition
Zhang et al. [12] presented a novel attention-based fully convolutional network for speech emotion recognition.Albanie et al. [13] considered learning embeddings for speech classification without access to labeled audio.Deep learning techniques are utilized as an alternative to traditional approaches in speech emotion recognition.Khalil et al. [14] discussed some recent literature where these methods are used for speech-based emotion recognition.At the emotion classification stage, an algorithm was proposed to determine the structure of the decision tree [15].By utilizing the characteristics of CNN in modeling contextual information, Latif et al. [16] proved that there is still potential to enhance the performance of emotion recognition from raw speech.Both verbal and nonverbal sounds within an utterance were thus considered for emotional recognition of real-life conversations [17].To create more accurate multimodal feature representations, Xu et al. [18] suggested using an attention mechanism to understand the alignment between speech frames and text words.. Koduru et al. [19] mainly contributed in improving a system's speech emotion recognition rate using the different feature extraction algorithms.Siriwardhana et al. [20] explore the use of modality-specific "BERT-like" pre-trained self-supervised learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition.Pepino et al. [21] proposed a transfer learning method for IAES Int J Rob & Autom, Vol. 12, No. 4, December 2023: 315-324 speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks.

Hybrid approaches for emotion recognition
Albraikan et al. [22] presented a hybrid sensor fusion approach based on a stacking model allowing information from various sensors and emotion models to be jointly incorporated within a user-independent model.Pane et al. [23] proposed strategies incorporating emotion lateralization and ensemble learning approach to enhance the accuracy of EEG-based emotion recognition.Alswaidan et al. [24] critically surveyed the state-of-the-art research for explicit and implicit emotion recognition in text.He discussed the different approaches in the literature, detailed their main features, advantages, and limitations, and compared them with tables.During human-computer interaction, it might be tricky to automatically recognize facial emotions.Sandhu et al. [25] used the hybrid CNN approach to recognize human emotions and categorize them into subcategories based on their features.A hybrid system consisting of three feature extraction stages, dimensionality reduction, and feature classification was proposed for speech emotion recognition (SER) [26], [27].
Moreover, a novel emotion recognition system, based on a number of modalities, including electroencephalogram (EEG), galvanic skin response (GSR), and facial expressions, was introduced [28].Siddiqui et al. [29] presented a multimodal automatic emotion recognition (AER) framework capable of accurately differentiating expressed emotions.To predict emotions by examining facial expressions in an image, a convolution neural network (CNN)-based deep learning method has been proposed [30].

Physiological signal-based emotion recognition
Shu et al. [31] gave an in-depth analysis of physiological signal-based emotion recognition that covered emotion models, emotion elicitation techniques, published emotional and physiological datasets, features, classifiers, and the entire framework for emotion recognition based on physiological signals.Li et al. [32] presented an extensive and organized taxonomy for recognizing emotions based on physiological signals.
The emotion recognition methods based on multi-channel EEG and multimodal physiological signals are reviewed [33].Kim et al. [34] presented a robust physiological model called a deep physiological affect network (DPAN) for recognizing human emotions.Li et al. [35], [36] proposed a multimodal attention-based BLSTM network framework for efficient emotion recognition.The work attempts to fuse the subject individual EDA features and the external evoked music features [37].Yin et al. [38] proposed an end-to-end multimodal framework, the one-dimensional residual temporal and channel attention network (RTCAN-1D).Chen et al. [39] proposed a single SP-signal-based method for emotion recognition.Stappen et al. [40] offered four different sub-challenges: i) MuSe-Wilder and ii) MuSe-Stress, which concentrate on continuous emotion (valence and arousal) prediction; iii) MuSe-Sent, which requires participants to identify five classes for valence and arousal; and iv) MuSe-Physio, which asks participants to predict a novel aspect of "physiological-emotion."For this year's challenge, the Ulm-TSST dataset, which displays people in stressful depositions, is introduced [40].To fill a gap in the present literature, Ahmad et al. [41] reviewed the impact of inter-subject data variance on emotion recognition, essential data annotation techniques for emotion recognition and their comparison, data pre-processing methods for each physiological signal, data splitting techniques to enhance the generalization of emotion recognition models and multiple multimodal fusion methods and their comparison.

RESEARCH METHOD
The complete methodology of the emotion detection framework is shown in Figure 1.First, we capture the image and crop it to the processing size.Next, we perform RGB to grayscale conversion and apply histogram equalization.Next, Canny edge detection and Hough circle transform are performed to locate a person's eyes.Next, we identify the critical points in the image as image descriptors for classification.We explain the process in detail in this section, and the working of the algorithm shown in Figure 2 is elaborated in this section.

Image capturing
We use a high-end webcam, the Logitech Brio 4K, designed for professional use.It can capture video in 4K Ultra HD at 30 frames per second or 1080p or 720p at up to 60 frames per second.It has advanced features such as autofocus and 5× digital zoom and supports high dynamic range (HDR) imaging for improved color and contrast in difficult lighting conditions.An emotion recognition system: bridging the gap between humans-machines interaction (Ahmed Nouman)

Conversion to grayscale image
We convert the captured image in the red, green, and blue (RGB) color space to a grayscale image in a single channel.Grayscale images are often used in computer vision tasks, such as object recognition, image segmentation, and edge detection, as they reduce the complexity of the image by eliminating color information while preserving the overall structure and contrast of the image.In a grayscale image, each pixel is represented by a single channel, with values ranging from 0 to 255, where 0 corresponds to black and 255 to white.

Histogram equalization
Next, we apply histogram equalization to the grayscale image.Histogram equalization is a method of contrast enhancement in digital image processing.It is a technique that redistributes the pixel intensities in an image to make the overall image contrast better.In other words, it adjusts the dynamic range of an image by spreading the intensity levels over the whole range.It is done by calculating a histogram of pixel intensities in the image and then modifying the pixel values so that the histogram becomes more evenly distributed.It can benefit images with a very narrow or compressed range of pixel intensities, making them appear flat or low in contrast.The result of histogram equalization is an image with higher contrast and better visibility of details.

Edge detection
Edge detection is a standard image processing technique used to identify and highlight the edges or boundaries within an image.The edges in an image represent areas of rapid changes in brightness or intensity, such as the boundaries between objects or the contours of shapes.The analyzing the intensity differences between adjacent pixels in an image and identifying areas where there is a sharp change in intensity.
There are several algorithms for edge detection, but some of the most common ones include Sobel [42], Canny [43], and Roberts operators [44].The Sobel operator calculates the image intensity gradient in horizontal and vertical directions.In contrast, the Canny operator uses a multi-stage algorithm that includes smoothing, edge detection, and hysteresis thresholding to produce high-quality edge detection results.The Roberts operator is a simple but effective operator that calculates the gradient using a pair of 2×2 kernels.We apply the Canny edge detection technique to extract features from our images.

Hough circle transform
The Hough circle transform [45], [46] is a feature extraction technique used in digital image processing to detect circular shapes in images.It is an extension of the Hough transform algorithm to detect straight image lines.The Hough circle transform converts the image from Cartesian coordinates to polar coordinates, representing circles as points in a two-dimensional parameter space.Each point in this parameter space corresponds to a circle in the original image, with the radius and center coordinates of the circle encoded in the coordinates of the point.We use the Hough circle transform algorithm to detect the iris of the human in our binary images provided by the Canny edge detection techniques.

Image registration
Image alignment, also known as image registration, is the process of aligning multiple images of the same scene or object.The goal is to find a transformation that maps one image onto another so that they are in the same coordinate system.Image alignment is important in various applications, including computer vision, remote sensing, medical imaging, and astronomy.Next, we align the eyes in the images via the image registration method.

Keypoints detection and image descriptors
Keypoint detection, also known as interest point detection, is a technique in computer vision that identifies and localizes distinctive features or points in an image.These key points are regions in an image with certain properties, such as high contrast, sharp edges, or corners, making them easily distinguishable from the surrounding areas.The process of keypoint detection typically involves analyzing an image using a series of algorithms to identify areas likely to be key points.Some popular algorithms for keypoint detection include Harris corner detection [47], scale-invariant feature transform (SIFT) [48], speeded-up robust features (SURF) [49], and ORB (Oriented FAST and Rotated BRIEF) [50].
We use a manual keypoint detection technique for the classification of emotions in humans for our methodology.We detect the bottom of the forehead by joining a line between the eyes in the images and locating its center.Next, we calculate the distances from the bottom of the forehead to key point locations like lips, cheeks, ears, chin, and forehead.We calculate 28 key points that serve as image descriptors for the classification methods.

Classification
Image classification is a task in computer vision that assigns a label or category to an input image.Image classification aims to teach a computer to recognize visual patterns in images and to classify them into one of several pre-defined categories or classes.This task is usually accomplished by training a machine learning model on a large dataset of labeled images, where the labels represent the correct category or class of the image.
There are several techniques used for image classification, including traditional machine learning algorithms such as support vector machines (SVM) [51] and k-nearest neighbors (KNN) [52], [53], Mixture of Experts (MoE) [54], AdaBoost [55], as well as deep learning methods such as convolutional neural networks (CNN) [56].Deep learning has become the dominant approach for image classification in recent years due to its ability to learn complex features directly from raw image data.We constructed the dataset manually with annotated 1800 images of persons in six different emotions.We took fifty candidates and captured six images of each person for every six emotions.Next, we divided the dataset into 75% (1350) images for training while the rest, 25% (450), were utilized to evaluate or generalize the ANN.The datasets are equally divided for each emotion to remove any inherent dataset bias.
We use an artificial neural network (ANN) with 28 input neurons, 56 second-layer hidden neurons, and six neurons in the output layer.The six output neurons classify human emotions in images from fear, Ì ISSN: 2722-2586 anger, sadness, happiness, excitement, and normal.The learning algorithm used is the Levenberg-Marquardt backpropagation algorithm [57]- [60].The network converges to the goal at around 252 epochs as shown in Figure 3 and the output of the ANN, in this case, normal is demonstrated in Figure 4.

RESULT AND DISCUSSION
The study's results reveal the effectiveness of the six-emotion classification algorithm in accurately classifying emotions using a machine learning framework.The algorithm's overall accuracy of 92.23% demonstrates its potential for practical applications in healthcare, marketing, education, and human-computer interaction.The successful classification of emotions using machine learning algorithms can enhance human-machine interactions, personalized user experiences, and informed decision-making processes in various fields.
The confusion matrix generated in the study offers a deeper understanding of the algorithm's performance as shown in Figure 5.It highlights that the algorithm had difficulties classifying "excited" and "afraid" emotions, with 14 and 8 false classifications, respectively.A closer analysis of the confusion matrix indicates that most false classifications occurred between "excited" and "happy" and between "excited" and "afraid".The algorithm also exhibited confusion between "excited" and "happy" and between "normal" and "happy".These misclassifications suggest that the algorithm might require further refinement to improve its performance in distinguishing emotions with similar expressions or characteristics.Another key finding of the study is the effectiveness of the manual distance calculation used as the image descriptor in the machine learning framework.It highlights the importance of selecting appropriate image descriptors for accurate emotion classification.Choosing the right image descriptor can significantly impact the algorithm's performance and the overall accuracy of emotion classification.

IAES
While the results demonstrate the potential of machine learning algorithms for accurately classifying emotions, it is essential to acknowledge the limitations in classifying specific emotions, particularly "excited" and "afraid".Further research and development of the algorithm could focus on addressing these limitations and enhancing its performance in classifying these challenging emotions.It might involve exploring alternative or additional image descriptors, refining the feature extraction process, or incorporating other machine-learning techniques to improve classification accuracy.

CONCLUSION
The research article's main conclusions highlight the potential and effectiveness of machine learning algorithms in accurately classifying emotions, an area of growing research interest with significant implications for fields like healthcare, marketing, education, and human-computer interaction.The evaluated six-emotion classification algorithm achieved an overall accuracy of 92.23%, showcasing its potential for practical applications in these fields.However, the confusion matrix generated during the study revealed some limitations in classifying specific emotions, particularly "excited" and "afraid".The majority of false classifications occurred between "excited" and "happy" and between "excited" and "afraid".It indicates the need for further algorithm refinement to improve its performance in distinguishing emotions with similar expressions or characteristics.The article also emphasizes the importance of selecting appropriate image descriptors for accurate emotion classification.The manual distance calculation used as the image descriptor in the machine learning framework proved effective, suggesting that the choice of image descriptor plays a significant role in determining the algorithm's performance and overall accuracy.The article's conclusions underscore the effectiveness of a six-emotion classification algorithm using a machine learning framework in emotion classification.However, addressing the limitations in classifying specific emotions will enhance the algorithm's performance and expand its practical applications in various fields that rely on accurate emotion recognition.

Figure 3 .Figure 4 .
Figure 3. Training curve for the artificial neural network model shows convergence to goal around 250 epochs

Figure 5 .
Figure 5. Confusion matrix for emotions classification from 450 test images of random people dataset