Abstract:
Internet is made up of web pages, news stories, status updates, blogs, and many 
other things. The amount of information available on the internet is ever-growing. 
Automatic text summarization is greatly needed to get the relevant information and 
faster consumption of relevant information. The basic goal of text summarizing is to 
extract the most accurate and helpful information from a large document while 
removing the unnecessary or unimportant information.
Automatic text summarization research for Myanmar language is not enough 
and there is no freely available Myanmar summarization corpus. Previous Myanmar 
text summarization was done using template driven summarization approach, 
summarization using verb frame resource, and query based text summarization 
approaches.
The primary aim of this dissertation is to develop automatic Myanmar news 
summarizing system based on centroid-based word embedding models and also test 
with Naïve Bayes method. Furthermore, there is no publicly available summarization 
corpus in Myanmar NLP research. In this dissertation, Myanmar news summarization 
corpus with news and summary pairs for Myanmar NLP research is constructed.
In Myanmar language, there are very few linguistic resources. Myanmar is a 
low-resource language, with few monolingual or multilingual corpora, as well as 
manually annotated linguistic resources, to support natural language processing (NLP) 
applications. The first attempt at Myanmar news summarization is the creation of a 
Myanmar news document and summary pair summarization corpus. One thousand 
Myanmar news are collected from Myanmar news websites for this corpus. Summary 
for each article is manually made by the human and 30% and 40% compression rate are 
used. Manual summary (golden summary) is time consuming and requires a lot of 
human effort.
In this dissertation, extractive Myanmar news summarization is implemented 
using unsupervised, centroid method and supervised, Naïve Bayes method. Any 
extractive text summarizing method usually includes representation model phase, 
scoring phase and ranking phase. 
Most extractive summarization system utilizes the bag of words model as a 
representation model, although it has limitations. The semantic meaning of words 
cannot be captured by bag of words. Although two sentences are strongly connected, 
bag-of-words representation cannot capture their relationship if there is no common 
word between sentences. In order to solve this issue, word embedding representation is 
used as a representation model for sentence scoring and ranking in this dissertation. 
Centroid-based method identifies the most central sentences in many documents that 
contain the necessary and sufficient information related to the main theme of the 
document.
In this research, two popular pretrained word embeddings, FastText and 
BPEmb, are used to represent sentences in order to overcome the drawbacks of the bag of-words approach. Various word embedding models and baseline bag-of-words model 
are compared in this dissertation. 
Additionally, in this research, Myanmar news summarization system are also 
implemented with Naïve Bayes approach. Naïve Bayes summarization includes three 
main parts: corpus creation, Naïve Bayes classifier training and testing with new input 
document. In supervised machine learning, data set of input observations, each 
associated with correct label is required. Therefore, Myanmar news summarization 
corpus (input and summary pair) are firstly built. Corpus is divided into training and 
testing corpus. In second part of system, Naïve Bayes classifier is learned to be correctly 
classified summary or non_summary sentence. To train the Naïve Bayes classifier, 
three features vectors TF_IDF (Term Frequency_Inverse Document Frequency), 
TF_ISF (Term Frequency_Inverse Sentence Frequency), and length are extracted from 
sentences. After training Naïve Bayes classifier, train model is applied on testing corpus 
for prediction. The collection of sentences which the sentence’ label is “summary” are 
generated as summary. From the experiments of this research, centroid method using 
FastText word embedding model gives the better ROUGE (Recall-Oriented 
Understudy for Gisting Evaluation) scores than Naïve Bayes and the unsupervised 
centroid methods.