Abstract:
This dissertation aims to investigate the data augmenting and scrutinizing
methods in developing a speech dataset for text independent Burmese speaker
identification in open-set case which means the test speaker may not pre-modeled and
included in the classifier. The training acoustic models are built based on Gaussian
Mixture Model-Universal Background Model (GMM-UBM) and Time Delay Neural
Network (TDNN) model. The speech dataset for speaker identification is firstly
constructed because there is no available speech dataset for speaker identification
research in Burmese. The data are collecting from the two domains: the web-based
news data and recorded daily conversations. By this dataset, state-of-the-art acoustic
speaker models for Burmese speaker identification are constructed.
Speaker identification is the task of analyzing the speakers’ characteristics in
speech to exactly identify individuals. The identification task performs better when
there is enough background training data. The sufficient amount of speech data
collection is a very challenging task in a short time for building Burmese speaker
identification system because Burmese language can be considered as an under
resourced language due to its linguistic resource availability. For getting sufficient
amount of background training data, MUSAN speech dataset is used as speech data
augmenting. For high quality training data, many other scrutinized techniques are
investigated. Among them, the two data scrutinizing methods: increasing the speech
intensity in SNRs to 10 dB and downing the tempo factor 0.2 times without affecting
the pitch of utterances are applied to the original speech dataset. Moreover, white
noise-added dataset is also created from the original dataset in order to prove that any
kinds of noise can cause trouble the identification performance. Mel Frequency
Cepstral Coefficient (MFCC) is used to extract the speaker specific features as front
end processing. In this work, TDNN and GMM-UBM based acoustic speaker models
are constructed based on original, scrutinized and white noise-added training data. It
can indicate that the impacts of speech data quality in constructing speaker models by
using scrutinized training data and points out the important role of speaker models in
identification process. The speakers’ identities are assessed with probabilistic linear
discriminant analysis (PLDA) approach. The system performance is presented in the
form of Equal Error Rate (EER) and detecting accuracy (Acc).