TY - GEN
T1 - Who said that? The crossmodal matching identity for inferring unfamiliar faces from voices
AU - Escoto Sotelo, E. A.
AU - Nakamura, Tomoaki
AU - Nagai, Takayuki
AU - Escamilla Hernandez, E.
PY - 2012
Y1 - 2012
N2 - This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.
AB - This paper proposes a method for matching unfamiliar person's face to unfamiliar voice. The idea behind this is crossmodal perception of human including many illusions such as the McGurk effect, ventriloquist illusion, and so on. Especially, we focus on recent psychological evidence suggesting human can do matching between unfamiliar faces and unfamiliar voices to some extent. The aim of this paper is to mimic this ability on a computer. In order to realize the matching of an unfamiliar person's face to an unfamiliar voice, a dataset of pairs of facial images and corresponding voices are used as knowledge. It means that the unfamiliar voice is matched to the closest known speaker model. Since the database contains corresponding facial image, the system can estimate a closest known face from the unfamiliar voice. Finally each unfamiliar face is matched to the estimated known face and the final recognition result is obtained. To this end, we first implement a speaker recognition system based on Mel Frequency Cepstral Coefficients as the speech feature and Gaussian mixtures models as the classifier. We also use a two-dimensional HMM-based face recognizer and propose a statistical integration of audio/visual recognition results. To show the possibility of the proposed system, unfamiliar speaker recognition experiments are carried out using 60 sentences from the ATR-503 sentences uttered by 20 university students.
KW - Audio-visual integration cross-modal matching identity system
KW - EM algorithm
KW - Gaussian mixture models
KW - Pseudo 2-D Hidden Markov model
UR - http://www.scopus.com/inward/record.url?scp=84874067299&partnerID=8YFLogxK
U2 - 10.1109/SITIS.2012.154
DO - 10.1109/SITIS.2012.154
M3 - Contribución a la conferencia
AN - SCOPUS:84874067299
SN - 9780769549118
T3 - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012r
SP - 97
EP - 104
BT - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012
T2 - 8th International Conference on Signal Image Technology and Internet Based Systems, SITIS 2012
Y2 - 25 November 2012 through 29 November 2012
ER -