Generating image captions through multimodal embedding

Sandeep Kumar Dash; Saurav Saha; Partha Pakray; Alexander Gelbukh

doi:10.3233/JIFS-179027

Generating image captions through multimodal embedding

Sandeep Kumar Dash, Saurav Saha, Partha Pakray, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Caption generation requires best of both Computer Vision and Natural Language Processing. Due to recent improvements in both of them many efficient models have been developed. Automatic Image Captioning can be utilized to provide descriptions of website content or to engender frame-by-frame descriptions of video for the vision-impaired and in many such applications. In this work, a model is described which is utilized to generate novel image captions for a previously unseen image by utilizing a multimodal architecture by amalgamation of a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). The model is trained on Microsoft Common Objects in Context (MSCOCO), an image captioning dataset that aligns captions and images in the same representation space, so that an image is close to its relevant captions in that space and far away from dissimilar captions and dissimilar images. ResNet-50 architecture is used for extracting features from the images and GloVe embeddings are used along with Gated Recurrent Unit (GRU) in Recurrent Neural Network (RNN) for text representation. MSCOCO evaluation server is used for evaluation of the machine generated caption for a given image.

Original language	English
Pages (from-to)	4787-4796
Number of pages	10
Journal	Journal of Intelligent and Fuzzy Systems
Volume	36
Issue number	5
DOIs	https://doi.org/10.3233/JIFS-179027
State	Published - 2019

Keywords

Convolutional neural network
Image captioning

Access to Document

10.3233/JIFS-179027

Cite this

@article{f719062c318e40cf9f1f6947b281bb79,

title = "Generating image captions through multimodal embedding",

abstract = "Caption generation requires best of both Computer Vision and Natural Language Processing. Due to recent improvements in both of them many efficient models have been developed. Automatic Image Captioning can be utilized to provide descriptions of website content or to engender frame-by-frame descriptions of video for the vision-impaired and in many such applications. In this work, a model is described which is utilized to generate novel image captions for a previously unseen image by utilizing a multimodal architecture by amalgamation of a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). The model is trained on Microsoft Common Objects in Context (MSCOCO), an image captioning dataset that aligns captions and images in the same representation space, so that an image is close to its relevant captions in that space and far away from dissimilar captions and dissimilar images. ResNet-50 architecture is used for extracting features from the images and GloVe embeddings are used along with Gated Recurrent Unit (GRU) in Recurrent Neural Network (RNN) for text representation. MSCOCO evaluation server is used for evaluation of the machine generated caption for a given image.",

keywords = "Convolutional neural network, Image captioning",

author = "Dash, {Sandeep Kumar} and Saurav Saha and Partha Pakray and Alexander Gelbukh",

note = "Publisher Copyright: {\textcopyright} 2019 - IOS Press and the authors.",

year = "2019",

doi = "10.3233/JIFS-179027",

language = "Ingl{\'e}s",

volume = "36",

pages = "4787--4796",

journal = "Journal of Intelligent and Fuzzy Systems",

issn = "1064-1246",

number = "5",

}

TY - JOUR

T1 - Generating image captions through multimodal embedding

AU - Dash, Sandeep Kumar

AU - Saha, Saurav

AU - Pakray, Partha

AU - Gelbukh, Alexander

PY - 2019

Y1 - 2019

N2 - Caption generation requires best of both Computer Vision and Natural Language Processing. Due to recent improvements in both of them many efficient models have been developed. Automatic Image Captioning can be utilized to provide descriptions of website content or to engender frame-by-frame descriptions of video for the vision-impaired and in many such applications. In this work, a model is described which is utilized to generate novel image captions for a previously unseen image by utilizing a multimodal architecture by amalgamation of a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). The model is trained on Microsoft Common Objects in Context (MSCOCO), an image captioning dataset that aligns captions and images in the same representation space, so that an image is close to its relevant captions in that space and far away from dissimilar captions and dissimilar images. ResNet-50 architecture is used for extracting features from the images and GloVe embeddings are used along with Gated Recurrent Unit (GRU) in Recurrent Neural Network (RNN) for text representation. MSCOCO evaluation server is used for evaluation of the machine generated caption for a given image.

AB - Caption generation requires best of both Computer Vision and Natural Language Processing. Due to recent improvements in both of them many efficient models have been developed. Automatic Image Captioning can be utilized to provide descriptions of website content or to engender frame-by-frame descriptions of video for the vision-impaired and in many such applications. In this work, a model is described which is utilized to generate novel image captions for a previously unseen image by utilizing a multimodal architecture by amalgamation of a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN). The model is trained on Microsoft Common Objects in Context (MSCOCO), an image captioning dataset that aligns captions and images in the same representation space, so that an image is close to its relevant captions in that space and far away from dissimilar captions and dissimilar images. ResNet-50 architecture is used for extracting features from the images and GloVe embeddings are used along with Gated Recurrent Unit (GRU) in Recurrent Neural Network (RNN) for text representation. MSCOCO evaluation server is used for evaluation of the machine generated caption for a given image.

KW - Convolutional neural network

KW - Image captioning

UR - http://www.scopus.com/inward/record.url?scp=85066452740&partnerID=8YFLogxK

U2 - 10.3233/JIFS-179027

DO - 10.3233/JIFS-179027

M3 - Artículo

AN - SCOPUS:85066452740

SN - 1064-1246

VL - 36

SP - 4787

EP - 4796

JO - Journal of Intelligent and Fuzzy Systems

JF - Journal of Intelligent and Fuzzy Systems

IS - 5

ER -

Generating image captions through multimodal embedding

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this