Generating image captions through multimodal embedding

Sandeep Kumar Dash, Saurav Saha, Partha Pakray, Alexander Gelbukh

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Caption generation requires best of both Computer Vision and Natural Language Processing. Due to recent improvements in both of them many efficient models have been developed. Automatic Image Captioning can be utilized to provide descriptions of website content or to engender frame-by-frame descriptions of video for the vision-impaired and in many such applications. In this work, a model is described which is utilized to generate novel image captions for a previously unseen image by utilizing a multimodal architecture by amalgamation of a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). The model is trained on Microsoft Common Objects in Context (MSCOCO), an image captioning dataset that aligns captions and images in the same representation space, so that an image is close to its relevant captions in that space and far away from dissimilar captions and dissimilar images. ResNet-50 architecture is used for extracting features from the images and GloVe embeddings are used along with Gated Recurrent Unit (GRU) in Recurrent Neural Network (RNN) for text representation. MSCOCO evaluation server is used for evaluation of the machine generated caption for a given image.

Original languageEnglish
Pages (from-to)4787-4796
Number of pages10
JournalJournal of Intelligent and Fuzzy Systems
Volume36
Issue number5
DOIs
StatePublished - 2019

Keywords

  • Convolutional neural network
  • Image captioning

Fingerprint

Dive into the research topics of 'Generating image captions through multimodal embedding'. Together they form a unique fingerprint.

Cite this