dc.description.abstract |
The goal of text-to-image synthesis is to automatically create an image that
corresponds to a given text description. It is the process of training a computer model
to understand natural language and translate it into visual representations. One of the
challenges of text-to-image synthesis is the semantic gap between natural language and
visual representations. Natural language processing and computer vision techniques can
be used to bridge this gap by mapping textual input to visual representations, which
helps to generate more accurate and meaningful images. In text-to-image synthesis,
computer vision is used to generate images that correspond to the textual input. On the
other hand, natural language processing is used to process the textual input and extract
meaningful information from it.
Text to image synthesis has gained popularity in recent years according to the
advancements results in deep learning. It has become an active research area in artificial
intelligence and has attracted searchers, practitioners, and the general public to focus
on this research. However, Text to image synthesis for Myanmar is a challenging
research problem because there are several factors that make generating images from
textual descriptions difficult. One of challenge is the scarcity of large-scale annotated
datasets of textual descriptions and corresponding images in Myanmar. Therefore,
Myanmar caption corpus is manually built based on Oxford-102 flowers dataset to build
Myanmar text-to-image synthesis (T2I) model.
In this dissertation, Myanmar T2I model is proposed using Generative
Adversarial Networks (GANs). Firstly, Myanmar T2I based DCGAN is proposed to
create images from Myanmar text descriptions. However, this model can generate low
resolution images (64x 64). For this reason, AttnGAN and DF-GAN are used to
investigate which model enable to generate high-resolution images (256 x 256) with
semantic accuracy from Myanmar text descriptions. In this comparison, DFGAN gives
better result for Myanmar T2I.
Moreover, DF-GAN+MSM (multimodal similarity model) is proposed in order
to generate semantically consistency images with precise in shape for Myanmar
language because there are artifacts that need to enhance on the generated images of
DF-GAN. In DF-GAN+MSM, DFGAN is applied to generate images from Myanmar
text descriptions. Multimodal similarity model is used to evaluate the matching score between Myanmar text and the generated images during training of the model. This
model contains two networks: text encoder and image encoder.
The evaluation on the performance of the models-based Myanmar T2I is done
on two areas: quantitative analysis and qualitative analysis to assess the quality of the
generated images. In quantitative analysis, DFGAN+MSM got the highest inception
scores and the lowest FID scores of the generated images. In addition, DFGAN+MSM
obtained the highest preferences scores based on qualitative evaluation by human
perception. Moreover, the proposed model is also implemented with UCSD-CUB birds
dataset annotated English to prove that this model gives the progressive results for the
quality of synthesized images from different languages and different dataset. |
en_US |