THE AUTOMATIC MYANMAR IMAGE CAPTIONING USING CNN AND BIDIRECTIONAL LSTM-BASED LANGUAGE MODEL

Aung, San Pa Pa

dc.contributor.author	Aung, San Pa Pa
dc.date.accessioned	2022-11-09T01:22:52Z
dc.date.available	2022-11-09T01:22:52Z
dc.date.issued	2022-11
dc.identifier.uri	https://onlineresource.ucsy.edu.mm/handle/123456789/2767
dc.description.abstract	Image captioning is one of the most challenging tasks in Artificial Intelligence which combines Computer Vision and Natural Language Processing (NLP). Computer vision is for detecting salient objects or extracting features of images as an encoder, and Natural Language Processing (NLP) is for generating correct syntactic and semantic image captions as decoder. Describing the contents of an image is a very complex task for machine without human intervention. Computer Vision and Natural Language Processing are widely used to tackle this problem. Although many image caption datasets such as Flickr8k, Flickr30k and MSCOCO are publicly available, most of the datasets are captioned in English language. There is no image caption corpus for Myanmar language. Therefore, Myanmar image caption corpus is created and annotated over 50k sentences for 10k images, which are based on Flickr8k dataset and 2k images are selected from Flickr30k dataset. In this dissertation, for the purpose of achieving better performance, two different types of segmentations such as word and syllable segmentation level are studied in text pre-processing step. Furthermore, the investigation on segmentation level affects the Myanmar image captioning system performance. The experimental results reveal that the syllable level segmentation gives significantly better performance for Myanmar image description compared with the word level segmentation. Additionally, this research also constructed its own GloVe vectors for both segmented corpora. As far as being aware and by means of this, this is the first attempt of applying syllable and word vector features in neural network-based Myanmar image captioning system and then compared with one-hot encoding vectors on various different models. Furthermore, the effect of applying GloVe vectors features in language modelling of EfficientNetB7 and Bi-LSTM based image captioning system are investigated in this work. According to the evaluation results, EfficientNetB7 with Bi-LSTM using word and syllable GloVe vectors features outperforms than EfficientNetB7 and Bi-LSTM with one-hot encoding, other state-of-the-art- neural networks such as Gated Recurrent Unit (GRU), Bidirectional Gated Recurrent Unit (Bi-GRU), and Long Short-Term Memory (LSTM), VGG16 with Bi-LSTM, NASNetLarge with Bi-LSTM models as well as baseline models. The EffecientNetB7 with Bi-LSTM using GloVe vectors achieved the highest BLEU-4 score of 35.09%, 49.52% of ROUGE-L, 54.34% of ROUGE-SU4 and 21.3% of METEOR score on word vectors, and the highest BLEU 4 score of 46.2%, 65.62% of ROUGE-L, 68.43% of ROUGE-SU4 and 27.07% of METEOR score on syllable vectors.	en_US
dc.language.iso	en	en_US
dc.publisher	University of Computer Studies, Yangon	en_US
dc.subject	Automatic Myanmar Image Captioning	en_US
dc.subject	CNN	en_US
dc.subject	Bidirectional LSTM-based Language Model	en_US
dc.title	THE AUTOMATIC MYANMAR IMAGE CAPTIONING USING CNN AND BIDIRECTIONAL LSTM-BASED LANGUAGE MODEL	en_US
dc.type	Thesis	en_US