Abstract:
In this paper, our research on speech
classification using an image classification approach
is discussed for the Myanmar language. We tested the
method for Myanmar consonants, vowels, and words,
on our recorded database of 22-consonant, 12-vowel,
and 54-word sound classes, containing spectrograms
of Myanmar speech. Because Myanmar language is
tonal, the sounds are very similar for precise
classification based on audio features, while the
visual representations differ. Therefore, it is
important to consider the visual representations of
audio in classifying the Myanmar language. In this
study, we treated Myanmar speeches with a
convolutional neural network model (Inception-v3) to
fit spectrogram images, performing transfer learning
from pre-trained weights on ImageNet. Validation
accuracies of 60.70%, 73.20%, and 94.60% were
achieved for the consonant, vowel, and word-level
classifications, respectively. In order to determine the
retrained model performance, both closed and open
testing were conducted. Although our experiment was
distinct from other traditional audio classification
methods, promising results were obtained for the first
exploration of Myanmar speech classification using
transfer learning for image classification. In fact,
these experimental results were attained using
Google’s Inception-v3 model, constructed with
different image domains. Therefore, the research and
results demonstrate that it is possible to perform
Myanmar speech classification.
Description:
The authors gratefully acknowledge the
teachers and students from the University of
Computer Studies, Banmaw, who participated in
recording the sounds for the Myanmar consonants
and vowels, and other speakers from Kumamoto
University, who aided in recording the words