Abstract:
This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological
analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire
are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian
Language Treebank Project. The annotated corpus has been released under a CC BY-NC-SA license, and it is
the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed
descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half
of the article. Facilitated by the annotated corpus, experiment-based investigations are presented in the second
half of the article, wherein the standard sequence-labeling approach of conditional random fields and a long
short-term memory (LSTM)-based recurrent neural network (RNN) are applied and discussed. We obtained
several general conclusions, covering the effect of joint tokenization and POS-tagging and importance of
ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid
basis for further studies on Burmese processing.