UCSY's Research Repository

THE AUTOMATIC MYANMAR IMAGE CAPTIONING USING CNN AND BIDIRECTIONAL LSTM-BASED LANGUAGE MODEL

Show simple item record

dc.contributor.author Aung, San Pa Pa
dc.date.accessioned 2022-11-09T01:22:52Z
dc.date.available 2022-11-09T01:22:52Z
dc.date.issued 2022-11
dc.identifier.uri https://onlineresource.ucsy.edu.mm/handle/123456789/2767
dc.description.abstract Image captioning is one of the most challenging tasks in Artificial Intelligence which combines Computer Vision and Natural Language Processing (NLP). Computer vision is for detecting salient objects or extracting features of images as an encoder, and Natural Language Processing (NLP) is for generating correct syntactic and semantic image captions as decoder. Describing the contents of an image is a very complex task for machine without human intervention. Computer Vision and Natural Language Processing are widely used to tackle this problem. Although many image caption datasets such as Flickr8k, Flickr30k and MSCOCO are publicly available, most of the datasets are captioned in English language. There is no image caption corpus for Myanmar language. Therefore, Myanmar image caption corpus is created and annotated over 50k sentences for 10k images, which are based on Flickr8k dataset and 2k images are selected from Flickr30k dataset. In this dissertation, for the purpose of achieving better performance, two different types of segmentations such as word and syllable segmentation level are studied in text pre-processing step. Furthermore, the investigation on segmentation level affects the Myanmar image captioning system performance. The experimental results reveal that the syllable level segmentation gives significantly better performance for Myanmar image description compared with the word level segmentation. Additionally, this research also constructed its own GloVe vectors for both segmented corpora. As far as being aware and by means of this, this is the first attempt of applying syllable and word vector features in neural network-based Myanmar image captioning system and then compared with one-hot encoding vectors on various different models. Furthermore, the effect of applying GloVe vectors features in language modelling of EfficientNetB7 and Bi-LSTM based image captioning system are investigated in this work. According to the evaluation results, EfficientNetB7 with Bi-LSTM using word and syllable GloVe vectors features outperforms than EfficientNetB7 and Bi-LSTM with one-hot encoding, other state-of-the-art- neural networks such as Gated Recurrent Unit (GRU), Bidirectional Gated Recurrent Unit (Bi-GRU), and Long Short-Term Memory (LSTM), VGG16 with Bi-LSTM, NASNetLarge with Bi-LSTM models as well as baseline models. The EffecientNetB7 with Bi-LSTM using GloVe vectors achieved the highest BLEU-4 score of 35.09%, 49.52% of ROUGE-L, 54.34% of ROUGE-SU4 and 21.3% of METEOR score on word vectors, and the highest BLEU 4 score of 46.2%, 65.62% of ROUGE-L, 68.43% of ROUGE-SU4 and 27.07% of METEOR score on syllable vectors. en_US
dc.language.iso en en_US
dc.publisher University of Computer Studies, Yangon en_US
dc.subject Automatic Myanmar Image Captioning en_US
dc.subject CNN en_US
dc.subject Bidirectional LSTM-based Language Model en_US
dc.title THE AUTOMATIC MYANMAR IMAGE CAPTIONING USING CNN AND BIDIRECTIONAL LSTM-BASED LANGUAGE MODEL en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Repository



Browse

My Account

Statistics