Joint Word Segmentation and Stemming for Myanmar Language

Oo, Yadanar

Joint Word Segmentation and Stemming for Myanmar Language

Oo, Yadanar

URI: http://onlineresource.ucsy.edu.mm/handle/123456789/2384

Date: 2019-10

Abstract:

Due to the powerful development of internet use, the amount of unstructured Myanmar text data has increased excessively. It is necessary to retrieve exact data for user query. The effectiveness of searching is obviously related to the stemming process. Identifying the stem word in a given text is an important aspect of any Natural Language Process. In Myanmar language, texts typically contain many different forms of a basic word. Morphological variants are generally the most common problem in mis-spellings, wrong translation and irrelevant retrieval query. Since Myanmar written language does not use blank spaces to indicate word boundaries, segmenting Myanmar texts becomes an essential task for Myanmar language processing. Besides word segmentation, it is necessary to identify the stem words in the sentence. Stemming refers to the process of marking each word in the word segmentation result with a correct word type, for example, root word, single word, prefix, suffix, etc. The segmentation and stemming process are denoted as morphological analysis. During the process of word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. There are basically two types of segmentation ambiguities: covering ambiguity and overlapping ambiguity. These ambiguities are dealt with known words. An unknown word is defined as a word that is not found in the system dictionary. In other words, it is an out-of-vocabulary word. For any languages, even the largest dictionary will not be capable of registering all geographical names, person names, organization names, technical terms and some duplication words, etc. Named entity recognition (NER), refers to recognizing entities that have specific meanings in the identified text, including persons, locations, organization, etc. Normally, stemming is considered as a separate process from segmentation. In this new approach, segmentation, stemming and named entity detection are integrated as a lexical analysis system. This research contributes to integrate segmentation, stemming and named entity detection that would benefit in all these process. Although many stemmers are available for the major languages, there is no stemmer for Myanmar Language. The main reason is to produce Myanmar stemmer and it also solves the word segmentation problem and detects the named entities. This is the first work on joint Myanmar word segmentation, stemming and named entity detection.iv Nowadays, deep learning approaches have become more and more popular in NLP tasks. This system proposes BiLSTM-CNN-CRF network architecture that jointly learns three processes. In this approach, stemming and named entity detection are considered as a typical sequence tagging problem over segmented words, while segmentation also can be modelled as a syllable-level tagging problem that identify the word boundaries via predicting the labels. This approach is an effective joint neural sequence labelling which predicts the combinatory labels of segmentation boundaries and stemming and named entity detection tag at the syllable level. This research presents BiLSTM-CNN-CRF architecture that learns both character and syllable-level features, presenting the first evaluating of such architecture on Myanmar language evaluation datasets. This research also evaluates over different network architecture and many hyper parameters optimization such as pre-trained embedding, dropout rate, learning rate and different optimizers.

Show full item record