Abstract:
The basic word identification is an essential
process in Part-of-Speech tagging as a
preprocessing step. Before disambiguating
among more than one Part-of-Speech tags of one
basic or root word, word boundaries need to be
identified in advance because basic words are
not consistently separated by any delimiters and
there is no standard break among these words in
Myanmar sentences. As a result, a word
identification or segmentation approach for
Myanmar sentences is proposed in this paper. A
Myanmar lexicon is used to identify each basic
word by applying longest word length matching
method and rules are generated to identify
reduplicated words. The proposed approach
achieves the high performance by evaluating on
two testing corpora and it is a very useful tool
for many Myanmar Natural Language
Processing applications