Communication with computing machinery has become increasingly ‘chatty’ these days: Alexa, Cortana, Siri, and many more dialogue systems have hit the consumer market on a broader basis than ever, but do any of them truly notice our emotions and react to them like a human conversational partner would?
In fact, the discipline of automatically recognizing human emotion and affective states from speech, usually referred to as Speech Emotion Recognition or SER for short, has by now surpassed the “age of majority,” celebrating the 22nd anniversary after the seminal work of Daellert in 1996 —arguably the first research paper on the topic. However, the idea has existed even longer, as the first patent dates back to the late 1970s.
Very little serious research had gone into the mechanism behind detection of emotional fluctuations from speech. The classical philosophical ideas are still valid, and if that is not the case people would not spend billions on detection of emotional state of a person from speech each year.
Most of the previous studies on speech emotion recognition have normally used pattern recognition methods using extracted acoustic features (such as pitch, energy and Melfrequency filter banks, Mel-frequency cepstral coefficients (MFCCs) etc.) from audio files. Popular classifiers are the linear discriminate classifier (Roy D, Pentland A (1996)) and the k-nearest neighbor (k-NN) Averill J (1994). In addition to the linear discriminate classifiers, Support Vector Machine (SVM) also achieves promising classification performance (You M, Chen C, Bu J, Liu J, Tao J (2006)). Non-linear discriminative classifiers such as Artificial Neural Network (ANN) and decision trees are also employed because of their robust performances in certain cases. However the same feature vector can even produce totally different classification results using different algorithms. Also, extraction and selection of features play an important role in emotional modeling and in deciding on what emotion should be perceived from a particular speech. Other important factors include speaking/listening mechanism, linguistics etc. Recently, Deep Belief Network models are attempted for emotional classification of audiovisual data, by capturing complex non-linear feature interactions in multimodal data (audio and video features).
In view of the above, it can be summarized that most of the speech emotion categorization techniques rely on the frequency-domain stationary methods like Fourier power spectrum. These methods have been strongly questioned for non-stationary aspects of signal. In any signal the spectrum covers wide range of frequencies and the numerous harmonics are left unattended by Fourier spectral analysis, which is mostly based on linear super-positions of trigonometric functions. Additional harmonic components, as is common in most natural non-stationary time-series, of which speech signal is one, may produce a deformed wave profile. Those deformations are the well-known consequence of nonlinear contributions. These nonlinear and non-stationary aspects contribute towards minute changes of the intricate dynamics of speech signal, which might be caused by any cognitive impairment, restricting proper emotional expression, due to complex diseases like Alzheimer’s Disease (AD). Not much has been done in this area by analyzing the nonstationary aspects of the speech signal.
In this product, rigorous non-stationary methodology capable of categorization of speech signals of various emotions is done. Multifractal Detrended Fluctuation Analysis (MFDFA) method is used to analyze the internal dynamics of the acoustics of digitized audio signal. Other non-linear methods like the visibility graph that can assess the degree of multifractality accurately and reliably are also being taken into account.