Abstract:
Myanmar language is spoken by more than 33 million people and use it as a
verbal and written communication which is an official language of the Republic of the
Union of Myanmar. With the rapid growth of digital content in Myanmar Language,
applications like machine learning, translation and information retrieval become
popular and it required to obtain the effective Natural Language Processing (NLP)
studies. The NLP field on Myanmar language still has a big challenge. Segmenting,
stemming and Part-Of- Speech (POS) tagging are pre-processing steps in Text Mining
applications as well as a very common requirement of Natural Language processing
functions. In fact, it is very important in most of the Information Retrieval systems. The
main objective of this thesis is to study Myanmar words morphology, to implement ngram based word segmentation and to propose grammatical stemming rules and POS
tagging rules for Myanmar language. This thesis proposed the word segmentation,
stemming and POS tagging based on n-gram method and rule-based stemming method
that has the ability to cope the challenges of Myanmar NLP tasks. This system not only
generates the segmented words but also generates the stemmed words with POS tag by
removing prefixes, infixes and suffixes. It provides 82 % accuracy. The data are
collected from several online sources and the system is implemented using Python
language.