Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

E. A. Escoto Sotelo; Tomoaki Nakamura; Takayuki Nagai; E. Escamilla Hernandez

doi:10.1109/SITIS.2012.154

Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

E. A. Escoto Sotelo, Tomoaki Nakamura, Takayuki Nagai, E. Escamilla Hernandez

Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Culhuacán

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Scopus citations

Abstract

This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.

Original language	English
Title of host publication	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012
Pages	97-104
Number of pages	8
DOIs	https://doi.org/10.1109/SITIS.2012.154
State	Published - 2012
Event	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012 - Sorrento, Italy Duration: 25 Nov 2012 → 29 Nov 2012

Publication series

Name	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r

Conference

Conference	8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012
Country/Territory	Italy
City	Sorrento
Period	25/11/12 → 29/11/12

Keywords

Audio-visual integration cross-modal matching identity system
EM algorithm
Gaussian mixture models
Pseudo 2-D Hidden Markov model

Access to Document

10.1109/SITIS.2012.154

Cite this

Escoto Sotelo, E. A., Nakamura, T., Nagai, T., & Escamilla Hernandez, E. (2012). Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. In 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012 (pp. 97-104). Article 6395080 (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r). https://doi.org/10.1109/SITIS.2012.154

Escoto Sotelo, E. A. ; Nakamura, Tomoaki ; Nagai, Takayuki et al. / Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012. 2012. pp. 97-104 (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r).

@inproceedings{658ada2c513847aeb1de6cde3a1224ae,

title = "Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices",

abstract = "This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.",

keywords = "Audio-visual integration cross-modal matching identity system, EM algorithm, Gaussian mixture models, Pseudo 2-D Hidden Markov model",

author = "{Escoto Sotelo}, {E. A.} and Tomoaki Nakamura and Takayuki Nagai and {Escamilla Hernandez}, E.",

year = "2012",

doi = "10.1109/SITIS.2012.154",

language = "Ingl{\'e}s",

isbn = "9780769549118",

series = "8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r",

pages = "97--104",

booktitle = "8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012",

note = "8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012 ; Conference date: 25-11-2012 Through 29-11-2012",

}

Escoto Sotelo, EA, Nakamura, T, Nagai, T & Escamilla Hernandez, E 2012, Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. in 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012., 6395080, 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r, pp. 97-104, 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012, Sorrento, Italy, 25/11/12. https://doi.org/10.1109/SITIS.2012.154

Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. / Escoto Sotelo, E. A.; Nakamura, Tomoaki; Nagai, Takayuki et al.
8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012. 2012. p. 97-104 6395080 (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

AU - Escoto Sotelo, E. A.

AU - Nakamura, Tomoaki

AU - Nagai, Takayuki

AU - Escamilla Hernandez, E.

PY - 2012

Y1 - 2012

N2 - This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.

AB - This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.

KW - Audio-visual integration cross-modal matching identity system

KW - EM algorithm

KW - Gaussian mixture models

KW - Pseudo 2-D Hidden Markov model

UR - http://www.scopus.com/inward/record.url?scp=84874067299&partnerID=8YFLogxK

U2 - 10.1109/SITIS.2012.154

DO - 10.1109/SITIS.2012.154

M3 - Contribución a la conferencia

AN - SCOPUS:84874067299

SN - 9780769549118

T3 - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r

SP - 97

EP - 104

BT - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012

T2 - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012

Y2 - 25 November 2012 through 29 November 2012

ER -

Escoto Sotelo EA, Nakamura T, Nagai T, Escamilla Hernandez E. Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices. In 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012. 2012. p. 97-104. 6395080. (8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r). doi: 10.1109/SITIS.2012.154

Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this