A Proposal on Statistical Language Model Building for Under-Resourced Languages

Nakajima, Hideharu

UCSYRR Home
/
Conferences
/
International Conference on Computer Applications (ICCA)
/
Eleventh International Conference On Computer Applications (ICCA 2013)
/
View Item

A Proposal on Statistical Language Model Building for Under-Resourced Languages

Nakajima, Hideharu

URI: http://onlineresource.ucsy.edu.mm/handle/123456789/430

Date: 2012-02-28

Abstract:

Corpora must be developed for realizing statistical language models; however, it is labor-intensive work and is very expensive. Often corpora resources are scant for such a domain as new language, new task, and new application. This situation is called “underresourced.” Myanmar must be a one of under-resourced languages for Japanese computational linguists, and vice versa. For both languages, mother tongues are “well-resourced.” Fortunately, as Japanese and Myanmar show the same Subject-Object-Verb (SOV) word order, higher translation quality can be expected between these two languages. Thus, Japanese and Myanmar can be mediator (or hub) languages to other languages each other. This paper proposes the reutilization of well-resourced domain data to offset under-resourced domain data. As a feasible example, this paper introduces the essence of my past research [1, 2] on model adaptation with machine translated text.

Description:

I would like to express my sincere thanks to Dr. Ye Kyaw Thu for a useful discussion on the basics of Myanmar language, to Dr. Satoshi Takahashi at NTT Media Intelligence Lab. and Professor Yoshinori Sagisaka at Waseda University for their warm encouragement, and to NTT Media Intelligence Lab. for their support.

Show full item record