Abstract:
Internet is made up of web pages, news stories, status updates, blogs, and many
other things. The amount of information available on the internet is ever-growing.
Automatic text summarization is greatly needed to get the relevant information and
faster consumption of relevant information. The basic goal of text summarizing is to
extract the most accurate and helpful information from a large document while
removing the unnecessary or unimportant information.
Automatic text summarization research for Myanmar language is not enough
and there is no freely available Myanmar summarization corpus. Previous Myanmar
text summarization was done using template driven summarization approach,
summarization using verb frame resource, and query based text summarization
approaches.
The primary aim of this dissertation is to develop automatic Myanmar news
summarizing system based on centroid-based word embedding models and also test
with Naïve Bayes method. Furthermore, there is no publicly available summarization
corpus in Myanmar NLP research. In this dissertation, Myanmar news summarization
corpus with news and summary pairs for Myanmar NLP research is constructed.
In Myanmar language, there are very few linguistic resources. Myanmar is a
low-resource language, with few monolingual or multilingual corpora, as well as
manually annotated linguistic resources, to support natural language processing (NLP)
applications. The first attempt at Myanmar news summarization is the creation of a
Myanmar news document and summary pair summarization corpus. One thousand
Myanmar news are collected from Myanmar news websites for this corpus. Summary
for each article is manually made by the human and 30% and 40% compression rate are
used. Manual summary (golden summary) is time consuming and requires a lot of
human effort.
In this dissertation, extractive Myanmar news summarization is implemented
using unsupervised, centroid method and supervised, Naïve Bayes method. Any
extractive text summarizing method usually includes representation model phase,
scoring phase and ranking phase.
Most extractive summarization system utilizes the bag of words model as a
representation model, although it has limitations. The semantic meaning of words
cannot be captured by bag of words. Although two sentences are strongly connected,
bag-of-words representation cannot capture their relationship if there is no common
word between sentences. In order to solve this issue, word embedding representation is
used as a representation model for sentence scoring and ranking in this dissertation.
Centroid-based method identifies the most central sentences in many documents that
contain the necessary and sufficient information related to the main theme of the
document.
In this research, two popular pretrained word embeddings, FastText and
BPEmb, are used to represent sentences in order to overcome the drawbacks of the bag of-words approach. Various word embedding models and baseline bag-of-words model
are compared in this dissertation.
Additionally, in this research, Myanmar news summarization system are also
implemented with Naïve Bayes approach. Naïve Bayes summarization includes three
main parts: corpus creation, Naïve Bayes classifier training and testing with new input
document. In supervised machine learning, data set of input observations, each
associated with correct label is required. Therefore, Myanmar news summarization
corpus (input and summary pair) are firstly built. Corpus is divided into training and
testing corpus. In second part of system, Naïve Bayes classifier is learned to be correctly
classified summary or non_summary sentence. To train the Naïve Bayes classifier,
three features vectors TF_IDF (Term Frequency_Inverse Document Frequency),
TF_ISF (Term Frequency_Inverse Sentence Frequency), and length are extracted from
sentences. After training Naïve Bayes classifier, train model is applied on testing corpus
for prediction. The collection of sentences which the sentence’ label is “summary” are
generated as summary. From the experiments of this research, centroid method using
FastText word embedding model gives the better ROUGE (Recall-Oriented
Understudy for Gisting Evaluation) scores than Naïve Bayes and the unsupervised
centroid methods.