Abstract:
In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging
are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such
as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in
word segmentation and POS tagging developed separately with different methods to get high performance
and accuracy. Word segmentation and Part-of-speech tagging is one of the important actions in language
processing. Against this, while numerous models are provided in different languages, few works have been
performed for Myanmar language. This paper describes the building of Myanmar Corpus to use for joint
word segmentation and part-of-speech tagging of Myanmar Language. In our research, the corpus contains
51207 sentences and 839161words. The corpus is created using 12 tags. To evaluate the accuracy of the
corpus, HMM model is trained on different data size and testing is done with closed test and opened test.
Results with 94% accuracy in the experiments show the appropriate efficiency of the built corpus.