Very deep convolutional neural network for speech recognition based on words

Javier O. Pinzon; Robinson Jimenez-Moreno; Oscar Aviles; Paola Nino; Diana Ovalle

doi:10.3923/jeasci.2018.6680.6685

Very deep convolutional neural network for speech recognition based on words

Javier O. Pinzon, Robinson Jimenez-Moreno, Oscar Aviles, Paola Nino, Diana Ovalle

Research output: Contribution to journal › Article › peer-review

Abstract

This study presents the implementation of two very deep convolutional neural network architectures applied to speech recognition based on the usage of complete words for this case 12 specific words in order to evaluate their performance in two types of environments, one semicontrolled and another non-controlled. One of the architectures developed is based on the use of linear filters only in frequency while the other consists of linear filters in both frequency and time. It is proposed to use the power spectral density with its first and second derivatives as input of the network in order to strengthen the variety of feature maps that can be used in neural networks for speech recognition. Finally, in the tests performed in real time, the architecture with filters of frequency and time reaches an error rate of 16.67% in a semicontrolled environment while the other architecture obtained a 41.67%. This means that the architecture with the lowest error rate has better performance for word recognition, even with small databases and specialized in a particular group of people.

Original language	English
Pages (from-to)	6680-6685
Number of pages	6
Journal	Journal of Engineering and Applied Sciences
Volume	13
Issue number	16
DOIs	https://doi.org/10.3923/jeasci.2018.6680.6685
State	Published - 2018
Externally published	Yes

Keywords

CNN architecture
Deep convolutional neural network
Power spectral density
Proposed
Speech recognition

Access to Document

10.3923/jeasci.2018.6680.6685

Cite this

@article{208b08f877224cddbdeb7aa0a015b31a,

title = "Very deep convolutional neural network for speech recognition based on words",

abstract = "This study presents the implementation of two very deep convolutional neural network architectures applied to speech recognition based on the usage of complete words for this case 12 specific words in order to evaluate their performance in two types of environments, one semicontrolled and another non-controlled. One of the architectures developed is based on the use of linear filters only in frequency while the other consists of linear filters in both frequency and time. It is proposed to use the power spectral density with its first and second derivatives as input of the network in order to strengthen the variety of feature maps that can be used in neural networks for speech recognition. Finally, in the tests performed in real time, the architecture with filters of frequency and time reaches an error rate of 16.67% in a semicontrolled environment while the other architecture obtained a 41.67%. This means that the architecture with the lowest error rate has better performance for word recognition, even with small databases and specialized in a particular group of people.",

keywords = "CNN architecture, Deep convolutional neural network, Power spectral density, Proposed, Speech recognition",

author = "Pinzon, {Javier O.} and Robinson Jimenez-Moreno and Oscar Aviles and Paola Nino and Diana Ovalle",

note = "Publisher Copyright: {\textcopyright} Medwell Journals, 2018.",

year = "2018",

doi = "10.3923/jeasci.2018.6680.6685",

language = "Ingl{\'e}s",

volume = "13",

pages = "6680--6685",

journal = "Journal of Engineering and Applied Sciences",

issn = "1816-949X",

publisher = "Medwell Journals",

number = "16",

}

TY - JOUR

T1 - Very deep convolutional neural network for speech recognition based on words

AU - Pinzon, Javier O.

AU - Jimenez-Moreno, Robinson

AU - Aviles, Oscar

AU - Nino, Paola

AU - Ovalle, Diana

N1 - Publisher Copyright: © Medwell Journals, 2018.

PY - 2018

Y1 - 2018

N2 - This study presents the implementation of two very deep convolutional neural network architectures applied to speech recognition based on the usage of complete words for this case 12 specific words in order to evaluate their performance in two types of environments, one semicontrolled and another non-controlled. One of the architectures developed is based on the use of linear filters only in frequency while the other consists of linear filters in both frequency and time. It is proposed to use the power spectral density with its first and second derivatives as input of the network in order to strengthen the variety of feature maps that can be used in neural networks for speech recognition. Finally, in the tests performed in real time, the architecture with filters of frequency and time reaches an error rate of 16.67% in a semicontrolled environment while the other architecture obtained a 41.67%. This means that the architecture with the lowest error rate has better performance for word recognition, even with small databases and specialized in a particular group of people.

AB - This study presents the implementation of two very deep convolutional neural network architectures applied to speech recognition based on the usage of complete words for this case 12 specific words in order to evaluate their performance in two types of environments, one semicontrolled and another non-controlled. One of the architectures developed is based on the use of linear filters only in frequency while the other consists of linear filters in both frequency and time. It is proposed to use the power spectral density with its first and second derivatives as input of the network in order to strengthen the variety of feature maps that can be used in neural networks for speech recognition. Finally, in the tests performed in real time, the architecture with filters of frequency and time reaches an error rate of 16.67% in a semicontrolled environment while the other architecture obtained a 41.67%. This means that the architecture with the lowest error rate has better performance for word recognition, even with small databases and specialized in a particular group of people.

KW - CNN architecture

KW - Deep convolutional neural network

KW - Power spectral density

KW - Proposed

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85054678500&partnerID=8YFLogxK

U2 - 10.3923/jeasci.2018.6680.6685

DO - 10.3923/jeasci.2018.6680.6685

M3 - Artículo

AN - SCOPUS:85054678500

SN - 1816-949X

VL - 13

SP - 6680

EP - 6685

JO - Journal of Engineering and Applied Sciences

JF - Journal of Engineering and Applied Sciences

IS - 16

ER -

Very deep convolutional neural network for speech recognition based on words

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this