Abstract:
This thesis focuses on enhancing Myanmar Text-to-Speech (TTS) system to
generate more natural synthetic speech for a given input text. Typical TTS systems
have two main components, text analysis (front-end), and speech waveform
generation (back-end). Both front-end and back-end parts are important for the
intelligibility and naturalness of the TTS system. Therefore, this thesis is emphasized
on both text analysis part and acoustic modelling part in Statistical Parametric Speech
Synthesis (SPSS) system.
Text analysis part consists of a number of natural language processing (NLP)
steps and text normalization is the first and crucial phase among them. Myanmar text
contains many non-standard words (NSWs) with numbers. Therefore, Myanmar
number normalization designed for Myanmar TTS system is implemented by using
Weighted Finite-State Transducers (WFSTs). For grapheme-to-phoneme (G2P)
conversion in text analysis part, the first large Myanmar pronunciation dictionary is
built, and the quality of that dictionary is confirmed by applying machine learning
techniques such as sequence to sequence modelling. With the purpose of extracting
contextual linguistic features which can promote the quality of the synthesized speech
of Myanmar TTS system, phoneme features and a large Myanmar pronunciation
dictionary with syllable information are prepared on a general speech synthesis
architecture, Festival. After that, a proposed Myanmar question set is applied in
extracting linguistic features which will be used in neural network based speech
synthesis. Finally, word segmentation, WFST based number normalization, G2P
conversion, and contextual labels extraction modules are integrated into text analysis
part of Myanmar TTS system.
The accuracy of acoustic model in SPSS is very important to achieve good
quality synthetic speech. In this work, Hidden Markov Model based Myanmar speech
synthesis is conducted with many contextual labels extracted from text analysis part
and used as the baseline system. The state-of-the-art modelling techniques such as
Deep Neural Network (DNN) and Long Short-Term Memory Recurrent Neural
Network (LSTM-RNN) have been applied in acoustic modelling of Myanmar speech
synthesis to promote the naturalness of synthesized speech. The effectiveness of
contextual linguistic features and tone information are explored in LSTM-RNN basediv
Myanmar speech synthesis using the proposed Myanmar question set. Furthermore,
the effect of applying word embedding and/or Part-of-Speech (POS) features as the
additional input features in acoustic modelling of DNN and LSTM-RNN based
systems are investigated in this work. The effect of word vector features can be seen
clearly in DNN based system in both objective and subjective evaluations. However,
in LSTM-RNN based systems, it can be observed that applying word embedding
features can only give little improvement in subjective results and it cannot lead to
any improvement in objective results. Therefore, it can be concluded that contextual
linguistic features extracted from our text analysis part and the proposed question set
are good enough for acoustic modelling of LSTM-RNN based Myanmar TTS system
to generate the more natural synthesized speech for Myanmar language. According to
the objective and subjective results, the hybrid system of DNN and LSTM-RNN (i.e.,
four feedforward hidden layers followed by two LSTM-RNN layers) is the most
suitable network architecture for Myanmar speech synthesis.