TY - GEN
T1 - Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video
AU - Benois-Pineau, J.
AU - Vázquez, M. S.García
AU - Morales, L. A.Oropesa
AU - Acosta, A. A.Ramirez
N1 - Publisher Copyright:
© 2017 Association for Computing Machinery.
PY - 2017/6/6
Y1 - 2017/6/6
N2 - Recognition of objects1 of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.
AB - Recognition of objects1 of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.
KW - Active contour
KW - Object recognition
KW - Saliency maps
KW - Visual object annotation
UR - http://www.scopus.com/inward/record.url?scp=85025585592&partnerID=8YFLogxK
U2 - 10.1145/3080538.3080541
DO - 10.1145/3080538.3080541
M3 - Contribución a la conferencia
AN - SCOPUS:85025585592
T3 - WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017
SP - 10
EP - 14
BT - WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017
PB - Association for Computing Machinery, Inc
T2 - 2017 Workshop on Wearable Multimedia, WearMMe 2017
Y2 - 6 June 2017
ER -