Abstract:
Machines are created to have the ability of human nowadays. Machines can
make learning, perceiving with intelligence of humans. Emotions can be known by
also expression of faces and gestures. Speech emotion recognition is extracting of
emotions from human speeches. However, even in humans, to recognize emotions in
their speeches is difficult without considering what they mean. In machines,
recognizing emotions is very difficult task. SER can be applied in several fields.
Burmese speech emotion recognition research is very few and speech emotion dataset
for Burmese language is low resource. Moreover, effective fusion of feature
extractions can get superior results than only one feature extraction. For the proposed
Burmese speech emotion classification, BMISEC dataset is used. BMISEC is
Burmese Movies Interviews Speech Emotion Corpus. It is a best prepared speech
emotion dataset as possible as. Deep learning architecture DenseNet has many
advantages. The main advantage of this model is gradients of the model and
improving information flow. It is easier for training than other models., which makes
them easy to train. DenseNet is used to classify fusion of the audio features and image
features in the proposed system. In DenseNet-Emotion used in the system,
SelectKBest feature selection is used for selecting the best features. moreover, SVM
is used in classifier layer of the model. The novel feature extraction for Burmese
speech emotion classification is proposed. This feature extraction is called Text-tone
feature extraction and in the feature extraction, emotion Myanmar sentences are
segmented into words. The words are converted into speech and from these emotion
words, pitch features, loudness features and duration features are extracted.ဗုဒၶံသရဏံဂစၦာမိ၊ စမၼံသရဏံဂစၦာမိ၊ သံဃံၼံသရဏံဂစၦာမိ
Burmese speech, there are low, high, creaky and checked tone. Pitch, loudness and
duration can distinguish the four tones very well. For emotion spectrograms, Local
Binary Pattern is used in the system. It is a good method for diverse objects. For
speech emotion spectrograms with various intensities, LBP is very well. Including
Burmese emotion spectrograms, many other languages emotion spectrograms, it can
extract emotion information very well. The two feature extraction methods are
supported by two popular speech feature extraction methods. They are Mel-frequency
Cepstral Coefficients (MFCC) and Discrete Wavelet Transform (DWT). They give
excellent result if there is no noise in the emotion speech. BMISEC is the best built
foundation for the proposed system. Therefore, it can support very well for the
system. Fusion of each feature extraction’s advantage can give the more excellent
result than single feature extraction. Novel new feature extraction is supported with
other three feature extractions to get the excellent result. There are seven emotion
types to be classify. These types are happy, angry, disgust, fear, surprise, sad and
neutral. The proposed system gets the accuracy of emotion classification of 88.388%
for only 50 epochs.