Abstract:
Image captioning is one of the most challenging tasks in Artificial Intelligence
which combines Computer Vision and Natural Language Processing (NLP). Computer
vision is for detecting salient objects or extracting features of images as an encoder, and
Natural Language Processing (NLP) is for generating correct syntactic and semantic
image captions as decoder. Describing the contents of an image is a very complex task
for machine without human intervention. Computer Vision and Natural Language
Processing are widely used to tackle this problem. Although many image caption
datasets such as Flickr8k, Flickr30k and MSCOCO are publicly available, most of the
datasets are captioned in English language. There is no image caption corpus for
Myanmar language. Therefore, Myanmar image caption corpus is created and
annotated over 50k sentences for 10k images, which are based on Flickr8k dataset and
2k images are selected from Flickr30k dataset.
In this dissertation, for the purpose of achieving better performance, two
different types of segmentations such as word and syllable segmentation level are
studied in text pre-processing step. Furthermore, the investigation on segmentation
level affects the Myanmar image captioning system performance. The experimental
results reveal that the syllable level segmentation gives significantly better performance
for Myanmar image description compared with the word level segmentation.
Additionally, this research also constructed its own GloVe vectors for both
segmented corpora. As far as being aware and by means of this, this is the first attempt
of applying syllable and word vector features in neural network-based Myanmar image
captioning system and then compared with one-hot encoding vectors on various
different models. Furthermore, the effect of applying GloVe vectors features in
language modelling of EfficientNetB7 and Bi-LSTM based image captioning system
are investigated in this work.
According to the evaluation results, EfficientNetB7 with Bi-LSTM using word
and syllable GloVe vectors features outperforms than EfficientNetB7 and Bi-LSTM
with one-hot encoding, other state-of-the-art- neural networks such as Gated Recurrent
Unit (GRU), Bidirectional Gated Recurrent Unit (Bi-GRU), and Long Short-Term
Memory (LSTM), VGG16 with Bi-LSTM, NASNetLarge with Bi-LSTM models as
well as baseline models. The EffecientNetB7 with Bi-LSTM using GloVe vectors
achieved the highest BLEU-4 score of 35.09%, 49.52% of ROUGE-L, 54.34% of ROUGE-SU4 and 21.3% of METEOR score on word vectors, and the highest BLEU 4 score of 46.2%, 65.62% of ROUGE-L, 68.43% of ROUGE-SU4 and 27.07% of
METEOR score on syllable vectors.