The Extractive Myanmar News Summarization Using  Centroid-based Method through Different Word  Embeddings

Lwin, Soe Soe

The Extractive Myanmar News Summarization Using Centroid-based Method through Different Word Embeddings

Lwin, Soe Soe

URI: https://onlineresource.ucsy.edu.mm/handle/123456789/2605

Date: 2022-05

Abstract:

Internet is made up of web pages, news stories, status updates, blogs, and many other things. The amount of information available on the internet is ever-growing. Automatic text summarization is greatly needed to get the relevant information and faster consumption of relevant information. The basic goal of text summarizing is to extract the most accurate and helpful information from a large document while removing the unnecessary or unimportant information. Automatic text summarization research for Myanmar language is not enough and there is no freely available Myanmar summarization corpus. Previous Myanmar text summarization was done using template driven summarization approach, summarization using verb frame resource, and query based text summarization approaches. The primary aim of this dissertation is to develop automatic Myanmar news summarizing system based on centroid-based word embedding models and also test with Naïve Bayes method. Furthermore, there is no publicly available summarization corpus in Myanmar NLP research. In this dissertation, Myanmar news summarization corpus with news and summary pairs for Myanmar NLP research is constructed. In Myanmar language, there are very few linguistic resources. Myanmar is a low-resource language, with few monolingual or multilingual corpora, as well as manually annotated linguistic resources, to support natural language processing (NLP) applications. The first attempt at Myanmar news summarization is the creation of a Myanmar news document and summary pair summarization corpus. One thousand Myanmar news are collected from Myanmar news websites for this corpus. Summary for each article is manually made by the human and 30% and 40% compression rate are used. Manual summary (golden summary) is time consuming and requires a lot of human effort. In this dissertation, extractive Myanmar news summarization is implemented using unsupervised, centroid method and supervised, Naïve Bayes method. Any extractive text summarizing method usually includes representation model phase, scoring phase and ranking phase. Most extractive summarization system utilizes the bag of words model as a representation model, although it has limitations. The semantic meaning of words cannot be captured by bag of words. Although two sentences are strongly connected, bag-of-words representation cannot capture their relationship if there is no common word between sentences. In order to solve this issue, word embedding representation is used as a representation model for sentence scoring and ranking in this dissertation. Centroid-based method identifies the most central sentences in many documents that contain the necessary and sufficient information related to the main theme of the document. In this research, two popular pretrained word embeddings, FastText and BPEmb, are used to represent sentences in order to overcome the drawbacks of the bag of-words approach. Various word embedding models and baseline bag-of-words model are compared in this dissertation. Additionally, in this research, Myanmar news summarization system are also implemented with Naïve Bayes approach. Naïve Bayes summarization includes three main parts: corpus creation, Naïve Bayes classifier training and testing with new input document. In supervised machine learning, data set of input observations, each associated with correct label is required. Therefore, Myanmar news summarization corpus (input and summary pair) are firstly built. Corpus is divided into training and testing corpus. In second part of system, Naïve Bayes classifier is learned to be correctly classified summary or non_summary sentence. To train the Naïve Bayes classifier, three features vectors TF_IDF (Term Frequency_Inverse Document Frequency), TF_ISF (Term Frequency_Inverse Sentence Frequency), and length are extracted from sentences. After training Naïve Bayes classifier, train model is applied on testing corpus for prediction. The collection of sentences which the sentence’ label is “summary” are generated as summary. From the experiments of this research, centroid method using FastText word embedding model gives the better ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores than Naïve Bayes and the unsupervised centroid methods.

Show full item record