Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video

J. Benois-Pineau; M. S.García Vázquez; L. A.Oropesa Morales; A. A.Ramirez Acosta

doi:10.1145/3080538.3080541

Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video

J. Benois-Pineau, M. S.García Vázquez, L. A.Oropesa Morales, A. A.Ramirez Acosta

Centro de Investigación y Desarrollo de Tecnología Digital (CITEDI)

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

2 Citas (Scopus)

Resumen

Recognition of objects1 of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.

Idioma original	Inglés
Título de la publicación alojada	WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017
Editorial	Association for Computing Machinery, Inc
Páginas	10-14
Número de páginas	5
ISBN (versión digital)	9781450350334
DOI	https://doi.org/10.1145/3080538.3080541
Estado	Publicada - 6 jun. 2017
Evento	2017 Workshop on Wearable Multimedia, WearMMe 2017 - Bucharest, Rumanía Duración: 6 jun. 2017 → …

Serie de la publicación

Nombre	WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017

Conferencia

Conferencia	2017 Workshop on Wearable Multimedia, WearMMe 2017
País/Territorio	Rumanía
Ciudad	Bucharest
Período	6/06/17 → …

Acceder al documento

10.1145/3080538.3080541

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Benois-Pineau, J., Vázquez, M. S. G., Morales, L. A. O., & Acosta, A. A. R. (2017). Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video. En WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017 (pp. 10-14). (WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017). Association for Computing Machinery, Inc. https://doi.org/10.1145/3080538.3080541

Benois-Pineau, J. ; Vázquez, M. S.García ; Morales, L. A.Oropesa et al. / Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video. WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017. Association for Computing Machinery, Inc, 2017. pp. 10-14 (WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017).

@inproceedings{a989caebe7a2453eb9666291f0da57ee,

title = "Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video",

abstract = "Recognition of objects1 of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.",

keywords = "Active contour, Object recognition, Saliency maps, Visual object annotation",

author = "J. Benois-Pineau and V{\'a}zquez, {M. S.Garc{\'i}a} and Morales, {L. A.Oropesa} and Acosta, {A. A.Ramirez}",

note = "Publisher Copyright: {\textcopyright} 2017 Association for Computing Machinery.; 2017 Workshop on Wearable Multimedia, WearMMe 2017 ; Conference date: 06-06-2017",

year = "2017",

month = jun,

day = "6",

doi = "10.1145/3080538.3080541",

language = "Ingl{\'e}s",

series = "WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017",

publisher = "Association for Computing Machinery, Inc",

pages = "10--14",

booktitle = "WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017",

}

Benois-Pineau, J, Vázquez, MSG, Morales, LAO & Acosta, AAR 2017, Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video. En WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017. WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017, Association for Computing Machinery, Inc, pp. 10-14, 2017 Workshop on Wearable Multimedia, WearMMe 2017, Bucharest, Rumanía, 6/06/17. https://doi.org/10.1145/3080538.3080541

Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video. / Benois-Pineau, J.; Vázquez, M. S.García; Morales, L. A.Oropesa et al.
WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017. Association for Computing Machinery, Inc, 2017. p. 10-14 (WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video

AU - Benois-Pineau, J.

AU - Vázquez, M. S.García

AU - Morales, L. A.Oropesa

AU - Acosta, A. A.Ramirez

PY - 2017/6/6

Y1 - 2017/6/6

N2 - Recognition of objects1 of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.

AB - Recognition of objects1 of a given category in visual content is one of the key problems in computer vision and multimedia. It is strongly needed in wearable video shooting for a wide range of important applications in society. Supervised learning approaches are proved to be the most efficient in this task. They require available ground truth for training models. It is specifically true for Deep Convolution Networks, but is also hold for other popular models such as SVM on visual signatures. Annotation of ground truth when drawing bounding boxes (BB) is a very tedious task requiring important human resource. The research in prediction of visual attention in images and videos has attained maturity, specifically in what concerns bottom-up visual attention modeling. Hence, instead of annotating the ground truth manually with BB we propose to use automatically predicted salient areas as object locators for annotation. Such a prediction of saliency is not perfect, nevertheless. Hence active contours models on saliency maps are used in order to isolate the most prominent areas covering the objects. The approach is tested in the framework of a well-studied supervised learning model by SVM with psycho-visual weighted Bag-of-Words. An egocentric GTEA dataset was used in the experiment. The difference in mAP (mean average precision) is less than 10 percent while the mean annotation time is 36% lower.

KW - Active contour

KW - Object recognition

KW - Saliency maps

KW - Visual object annotation

UR - http://www.scopus.com/inward/record.url?scp=85025585592&partnerID=8YFLogxK

U2 - 10.1145/3080538.3080541

DO - 10.1145/3080538.3080541

M3 - Contribución a la conferencia

AN - SCOPUS:85025585592

T3 - WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017

SP - 10

EP - 14

BT - WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017

PB - Association for Computing Machinery, Inc

T2 - 2017 Workshop on Wearable Multimedia, WearMMe 2017

Y2 - 6 June 2017

ER -

Benois-Pineau J, Vázquez MSG, Morales LAO, Acosta AAR. Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video. En WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017. Association for Computing Machinery, Inc. 2017. p. 10-14. (WearMMe 2017 - Proceedings of the 2017 Workshop on Wearable MultiMedia, co-located with ICMR 2017). doi: 10.1145/3080538.3080541

Semi-automatic annotation with predicted visual saliency maps for object recognition in wearable video

Resumen

Serie de la publicación

Conferencia

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto