Abstract:
Myanmar sentences are written as contiguous
sequences of syllables with no characters delimiting the
words. In statistical machine translation (SMT), word
segmentation is a necessary step for languages that do
not naturally delimit words. Myanmar is a low-resource
language and therefore it is difficult to develop a good
word segmentation tool based on machine learning
techniques. In this paper, we examine various word
segmentation schemes and their effect on the translation
from Myanmar to seven other languages. We performed
experiments based on character segmentation, syllable
segmentation, human lexical/phrasal segmentation, and
unsupervised/supervised word segmentation. The results
show that the highest quality machine translation was
attained with syllable segmentation, and we found this
effect to be greatest for translation into subject-objectverb (SOV) structured languages such as Japanese and
Korean. Approaches based on machine learning were
unable to match this performance for most language
pairs, and we believe this was due to the lack of
linguistic resources. However, a machine learning
approach that extended syllable segmentation produced
promising results and we expect this can be developed
into a viable method as more data becomes available in
the future.