Abstract:
More and more information is being created at online every day, and a lot of it
is the natural language. Until recently, businesses have been unable to analyze this
data. But advances in Natural Language Processing (NLP) make it possible to analyze
and learn from a greater range of data sources. Additionally, NLP has many central
implications on the ways that computers and humans network on our daily life. By
promising a bridge between human and machine, and accessing stored information,
NLP plays a vital role in the multilingual society. Technologies constructed on NLP
are becoming increasingly widespread.
Named Entity Recognition (NER), the task of recognizing names in text and
assigning those recognized Named Entities (NEs) to particular NE types such as
person name, location or organization, is a key component in many sophisticated
systems, especially in information retrieval (IR) systems. NER for Myanmar language
is essential for the development of Myanmar NLP and it is not an easy task for many
reasons.
This dissertation aims to develop Named Entity Recognition (NER) for
Myanmar language as well as to promote Myanmar NLP research. Myanmar NLP is
said to be still developing and has now been struggling to be developed. In the same
situation, there are no publicly available resources that can be accessed freely or
commercially for language computation so that Myanmar is being regarded as lowresourced language. For this reason, named entity (NE) tagged corpus for Myanmar
NER research is manually annotated and constructed as part of this dissertation. The
annotated NE corpus is essential for the development of Myanmar NER research.
This NE tagged corpus is applied during all the conducted experiments for Myanmar
NER and it will also be provided for future NER research.
In written style of Myanmar language, there is no regular space between
words or phrases. In Myanmar language, syllables are the basic units. Thus, all the
experiments are conducted on syllable-level data instead of characters or words in this
work.
In this study, NER for Myanmar language is built by applying deep neural
network architecture which can be said that Long Short-Term Memory (LSTM) -
based network. The performance of neural model is also compared with baseline
statistical Conditional Random Field (CRF) model. This statistical model totallyiv
depends on feature engineering. As Myanmar language is low-resourced language,
named dictionary or gazetteers are not available. If these external feature resources
are available and feature engineering is carefully done based on knowledge to cover
all situations, statistical methods provide a superior result. In this work, it has been
proved that unless using additional features, deep neural networks work well on
Myanmar NER and outperform baseline statistical CRF model. The best accuracy is
achieved with bidirectional LSTM based network architecture. Therefore, this work
eliminates the feature-engineering process and does not need to have language or
domain knowledge.
The proposed syllable-based neural architecture for Myanmar NER model has
three main layers: a character sequence layer, a syllable sequence layer, and inference
layer. For each input syllable sequence, syllables are represented with their syllable
embeddings. The character sequence layer is used to automatically extract syllable
level features by encoding the character sequence within the syllable. Convolutional
Neural Network (CNN) is applied to learn character sequence feature within each
input syllable at character sequence representation layer. The syllable sequence layer
takes the syllable representations as input and extracts the sentence level features,
which are fed into the inference layer. For the syllable sequence representation,
bidirectional LSTM is utilized to learn sentence level feature, and then CRF inference
layer is jointly added above the network to tag the name labels. This proposed
CNN_BiLSTM_CRF neural model gives the best performance out of the conducted
experiments for the Myanmar NER.