Enhancing Myanmar Text-to-Speech System by Using Linguistic Information on LSTM-RNN Based Speech Synthesis Model and Text Normalization

Hlaing, Aye Mya

Enhancing Myanmar Text-to-Speech System by Using Linguistic Information on LSTM-RNN Based Speech Synthesis Model and Text Normalization

Hlaing, Aye Mya

URI: http://onlineresource.ucsy.edu.mm/handle/123456789/2529

Date: 2020-06

Abstract:

This thesis focuses on enhancing Myanmar Text-to-Speech (TTS) system to generate more natural synthetic speech for a given input text. Typical TTS systems have two main components, text analysis (front-end), and speech waveform generation (back-end). Both front-end and back-end parts are important for the intelligibility and naturalness of the TTS system. Therefore, this thesis is emphasized on both text analysis part and acoustic modelling part in Statistical Parametric Speech Synthesis (SPSS) system. Text analysis part consists of a number of natural language processing (NLP) steps and text normalization is the first and crucial phase among them. Myanmar text contains many non-standard words (NSWs) with numbers. Therefore, Myanmar number normalization designed for Myanmar TTS system is implemented by using Weighted Finite-State Transducers (WFSTs). For grapheme-to-phoneme (G2P) conversion in text analysis part, the first large Myanmar pronunciation dictionary is built, and the quality of that dictionary is confirmed by applying machine learning techniques such as sequence to sequence modelling. With the purpose of extracting contextual linguistic features which can promote the quality of the synthesized speech of Myanmar TTS system, phoneme features and a large Myanmar pronunciation dictionary with syllable information are prepared on a general speech synthesis architecture, Festival. After that, a proposed Myanmar question set is applied in extracting linguistic features which will be used in neural network based speech synthesis. Finally, word segmentation, WFST based number normalization, G2P conversion, and contextual labels extraction modules are integrated into text analysis part of Myanmar TTS system. The accuracy of acoustic model in SPSS is very important to achieve good quality synthetic speech. In this work, Hidden Markov Model based Myanmar speech synthesis is conducted with many contextual labels extracted from text analysis part and used as the baseline system. The state-of-the-art modelling techniques such as Deep Neural Network (DNN) and Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) have been applied in acoustic modelling of Myanmar speech synthesis to promote the naturalness of synthesized speech. The effectiveness of contextual linguistic features and tone information are explored in LSTM-RNN basediv Myanmar speech synthesis using the proposed Myanmar question set. Furthermore, the effect of applying word embedding and/or Part-of-Speech (POS) features as the additional input features in acoustic modelling of DNN and LSTM-RNN based systems are investigated in this work. The effect of word vector features can be seen clearly in DNN based system in both objective and subjective evaluations. However, in LSTM-RNN based systems, it can be observed that applying word embedding features can only give little improvement in subjective results and it cannot lead to any improvement in objective results. Therefore, it can be concluded that contextual linguistic features extracted from our text analysis part and the proposed question set are good enough for acoustic modelling of LSTM-RNN based Myanmar TTS system to generate the more natural synthesized speech for Myanmar language. According to the objective and subjective results, the hybrid system of DNN and LSTM-RNN (i.e., four feedforward hidden layers followed by two LSTM-RNN layers) is the most suitable network architecture for Myanmar speech synthesis.

Show full item record