Abstract:
Due to the powerful development of internet use, the amount of unstructured
Myanmar text data has increased excessively. It is necessary to retrieve exact data for
user query. The effectiveness of searching is obviously related to the stemming process.
Identifying the stem word in a given text is an important aspect of any Natural Language
Process. In Myanmar language, texts typically contain many different forms of a basic
word. Morphological variants are generally the most common problem in mis-spellings,
wrong translation and irrelevant retrieval query.
Since Myanmar written language does not use blank spaces to indicate word
boundaries, segmenting Myanmar texts becomes an essential task for Myanmar
language processing. Besides word segmentation, it is necessary to identify the stem
words in the sentence. Stemming refers to the process of marking each word in the word
segmentation result with a correct word type, for example, root word, single word,
prefix, suffix, etc. The segmentation and stemming process are denoted as
morphological analysis. During the process of word segmentation, two main problems
occur: segmentation ambiguities and unknown word occurrences. There are basically
two types of segmentation ambiguities: covering ambiguity and overlapping ambiguity.
These ambiguities are dealt with known words. An unknown word is defined as a word
that is not found in the system dictionary. In other words, it is an out-of-vocabulary
word. For any languages, even the largest dictionary will not be capable of registering
all geographical names, person names, organization names, technical terms and some
duplication words, etc. Named entity recognition (NER), refers to recognizing entities
that have specific meanings in the identified text, including persons, locations,
organization, etc.
Normally, stemming is considered as a separate process from segmentation. In
this new approach, segmentation, stemming and named entity detection are integrated
as a lexical analysis system. This research contributes to integrate segmentation,
stemming and named entity detection that would benefit in all these process. Although
many stemmers are available for the major languages, there is no stemmer for Myanmar
Language. The main reason is to produce Myanmar stemmer and it also solves the word
segmentation problem and detects the named entities. This is the first work on joint
Myanmar word segmentation, stemming and named entity detection.iv
Nowadays, deep learning approaches have become more and more popular in
NLP tasks. This system proposes BiLSTM-CNN-CRF network architecture that jointly
learns three processes. In this approach, stemming and named entity detection are
considered as a typical sequence tagging problem over segmented words, while
segmentation also can be modelled as a syllable-level tagging problem that identify the
word boundaries via predicting the labels. This approach is an effective joint neural
sequence labelling which predicts the combinatory labels of segmentation boundaries
and stemming and named entity detection tag at the syllable level.
This research presents BiLSTM-CNN-CRF architecture that learns both
character and syllable-level features, presenting the first evaluating of such architecture
on Myanmar language evaluation datasets. This research also evaluates over different
network architecture and many hyper parameters optimization such as pre-trained
embedding, dropout rate, learning rate and different optimizers.