Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

E. A. Escoto Sotelo; Tomoaki Nakamura; Takayuki Nagai; E. Escamilla Hernandez

doi:10.1109/SITIS.2012.154

Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

E. A. Escoto Sotelo, Tomoaki Nakamura, Takayuki Nagai, E. Escamilla Hernandez

Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Culhuacán

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

2 Citas (Scopus)

Resumen

This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.

Idioma original	Inglés
Título de la publicación alojada	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012
Páginas	97-104
Número de páginas	8
DOI	https://doi.org/10.1109/SITIS.2012.154
Estado	Publicada - 2012
Evento	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012 - Sorrento, Italia Duración: 25 nov. 2012 → 29 nov. 2012

Serie de la publicación

Nombre	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r

Conferencia

Conferencia	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012
País/Territorio	Italia
Ciudad	Sorrento
Período	25/11/12 → 29/11/12

Acceder al documento

10.1109/SITIS.2012.154

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Escoto Sotelo, E. A., Nakamura, T., Nagai, T., & Escamilla Hernandez, E. (2012). Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. En 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012 (pp. 97-104). Artículo 6395080 (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r). https://doi.org/10.1109/SITIS.2012.154

Escoto Sotelo, E. A. ; Nakamura, Tomoaki ; Nagai, Takayuki et al. / Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012. 2012. pp. 97-104 (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r).

@inproceedings{658ada2c513847aeb1de6cde3a1224ae,

title = "Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices",

abstract = "This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.",

keywords = "Audio-visual integration cross-modal matching identity system, EM algorithm, Gaussian mixture models, Pseudo 2-D Hidden Markov model",

author = "{Escoto Sotelo}, {E. A.} and Tomoaki Nakamura and Takayuki Nagai and {Escamilla Hernandez}, E.",

year = "2012",

doi = "10.1109/SITIS.2012.154",

language = "Ingl{\'e}s",

isbn = "9780769549118",

series = "8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r",

pages = "97--104",

booktitle = "8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012",

note = "8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012 ; Conference date: 25-11-2012 Through 29-11-2012",

}

Escoto Sotelo, EA, Nakamura, T, Nagai, T & Escamilla Hernandez, E 2012, Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. En 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012., 6395080, 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r, pp. 97-104, 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012, Sorrento, Italia, 25/11/12. https://doi.org/10.1109/SITIS.2012.154

Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. / Escoto Sotelo, E. A.; Nakamura, Tomoaki; Nagai, Takayuki et al.
8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012. 2012. p. 97-104 6395080 (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

AU - Escoto Sotelo, E. A.

AU - Nakamura, Tomoaki

AU - Nagai, Takayuki

AU - Escamilla Hernandez, E.

PY - 2012

Y1 - 2012

N2 - This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.

AB - This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.

KW - Audio-visual integration cross-modal matching identity system

KW - EM algorithm

KW - Gaussian mixture models

KW - Pseudo 2-D Hidden Markov model

UR - http://www.scopus.com/inward/record.url?scp=84874067299&partnerID=8YFLogxK

U2 - 10.1109/SITIS.2012.154

DO - 10.1109/SITIS.2012.154

M3 - Contribución a la conferencia

AN - SCOPUS:84874067299

SN - 9780769549118

T3 - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r

SP - 97

EP - 104

BT - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012

T2 - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012

Y2 - 25 November 2012 through 29 November 2012

ER -

Escoto Sotelo EA, Nakamura T, Nagai T, Escamilla Hernandez E. Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. En 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012. 2012. p. 97-104. 6395080. (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r). doi: 10.1109/SITIS.2012.154

Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

Resumen

Serie de la publicación

Conferencia

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto