Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging

Ding, Chen Chen; Aye, Hnin Thu Zar; Pa, Win Pa; Nwet, Khin Thandar; Soe, Khin Mar; Utiyama, Masao; Sumita, Eiichiro

dc.contributor.author	Ding, Chen Chen
dc.contributor.author	Aye, Hnin Thu Zar
dc.contributor.author	Pa, Win Pa
dc.contributor.author	Nwet, Khin Thandar
dc.contributor.author	Soe, Khin Mar
dc.contributor.author	Utiyama, Masao
dc.contributor.author	Sumita, Eiichiro
dc.date.accessioned	2020-12-30T05:47:29Z
dc.date.available	2020-12-30T05:47:29Z
dc.date.issued	2019-06
dc.identifier.issn	2375-4699
dc.identifier.uri	https://onlineresource.ucsy.edu.mm/handle/123456789/2544
dc.description.abstract	This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated corpus has been released under a CC BY-NC-SA license, and it is the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half of the article. Facilitated by the annotated corpus, experiment-based investigations are presented in the second half of the article, wherein the standard sequence-labeling approach of conditional random fields and a long short-term memory (LSTM)-based recurrent neural network (RNN) are applied and discussed. We obtained several general conclusions, covering the effect of joint tokenization and POS-tagging and importance of ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid basis for further studies on Burmese processing.	en_US
dc.language.iso	en	en_US
dc.publisher	ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) Journal	en_US
dc.relation.ispartofseries	Volume 19, Issue 1;
dc.subject	Burmese (Myanmar)	en_US
dc.subject	annotated corpus	en_US
dc.subject	tokenization	en_US
dc.subject	POS-tagging	en_US
dc.subject	morphological analysis	en_US
dc.subject	CRF	en_US
dc.subject	LSTM-based RNN	en_US
dc.title	Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging	en_US
dc.type	Article	en_US