Abstract:
The main objective of this paper is to improve the automatic Myanmar captions by learning the contents of images using NASNetLarge and Bi-LSTM model. Describing the contents of an image is a complex task for machine without human intervention. Computer Vision and Natural Language Processing are widely used to tackle this problem. This paper proposed a deep learning-based Myanmar image captioning system which used a NASNetLarge feature extraction model of CNN as an encoder and a deep Recurrent Neural Network (RNN) with Bi-directional Long Short-Term Memory (LSTM) as a decoder. For corpus construction, we created and annotated the Myanmar image captions corpus (consists of over 40k Myanmar sentences), which is based on Flickr8k dataset. Furthermore, two different types of segmentations such as word segmentation level and syllable segmentation level are studied in text preprocessing step. In this work, the proposed Bi- directional LSTM model is compared with LSTM, GRU as well as the baseline model. Experiments on the updated dataset is presented that all of our models using syllable segmentation give higher and comparable BLEU scores than word segmentation for Myanmar image captioning system. NASNetLarge with Bi- directional LSTM model using syllable segmentation approach achieved the highest BLEU-4 score 40.05% which is 12.5% better than word segmentation in this work and 15.67% BLEU- 4 score better than our previous work.