Speaker recognition with hybrid features from a deep belief network

Ali, Hazrat; Tran, Son N.; Benetos, Emmanouil; d’Avila Garcez, Artur S.

doi:10.1007/s00521-016-2501-7

Speaker recognition with hybrid features from a deep belief network

Original Article
Published: 17 August 2016

Volume 29, pages 13–19, (2018)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Hazrat Ali¹,
Son N. Tran²,
Emmanouil Benetos^2,3 &
…
Artur S. d’Avila Garcez²

1362 Accesses
70 Citations
1 Altmetric
Explore all metrics

Abstract

Learning representation from audio data has shown advantages over the handcrafted features such as mel-frequency cepstral coefficients (MFCCs) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different lengths. In particular, we study the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts. These vectors represent the audio scripts of different lengths that make them easier to train a classifier. We show in the experiment that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio word count vector and the MFCC features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Acoustic Modeling with Deep Belief Networks for Russian Speech Recognition

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Article 09 November 2018

Fatemeh Vakhshiteh & Farshad Almasganj

Convolutional neural network vectors for speaker recognition

Article 22 January 2021

Soufiane Hourri, Nikola S. Nikolov & Jamal Kharroubi

Notes

A useful survey is presented by Kinnunen et al. [2] on the use of MFCCs and other features such as super vectors for speaker recognition.
Besides the work reviewed in this section, a more recent work has been reported lately in [3], which presents a deep neural network approach for speaker recognition task.
The i-vector is a recently developed features set for representation of speech data in low dimension [8] and has attracted the machine learning community through the NIST i-vector challenge [9, 10].
A useful tutorial on SVM is available from Burges [17].
The dataset can be requested via email.
Previous experimentations with this dataset for speech recognition applications have been reported by [20, 21].

References

Mohamed AR, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22
Article Google Scholar
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40
Article Google Scholar
Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675
Article Google Scholar
Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A (eds) Advances in neural information processing systems. NIPS, Abu Dhabi, pp 1096–1104
Google Scholar
Garofolo J, Lamel L, Fisher W, Fiscus J, Pallett D, Dahlgren N, Zue V (1993) DARPA TIMIT acoustic phonetic continuous speech corpus cdrom. https://catalog.ldc.upenn.edu/LDC93S1
Senoussaoui M, Dehak N, Kenny P, Dehak R, Dumouchel P (2012) First attempt of Boltzmann machines for speaker verification. In: Odyssey 2012: the speaker and language recognition workshop. ACM, pp 1064–1071
Ghahabi O, Hernando J (2014) Deep belief networks for i-vector based speaker recognition. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1700–1704
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
NIST i-vector Machine Learning Challenge (2014). https://ivectorchallenge.nist.gov/
Ali H, d’Avila Garcez A, Tran S, Zhou X, Iqbal K (2014) Unimodal late fusion for NIST i-vector challenge on speaker detection. Electron Lett 50(15):1098–1100
Article Google Scholar
Molau S, Pitz M, Schluter R, Ney H (2001) Computing mel-frequency cepstral coefficients on the power spectrum. In: Proceedings of 2001 IEEE international conference on acoustics, speech, and signal processing, vol 1, pp 73–76
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Le Roux N, Bengio Y (2008) Representational power of restricted Boltzmann machines and deep belief networks. Neural Comput 20(6):1631–1649
Article MathSciNet MATH Google Scholar
Deng L, Yu D (2014) Deep learning: methods and applications. NOW Publishers, Breda
MATH Google Scholar
Freund Y, Haussler D (1994) Unsupervised learning of distributions on binary vectors using two layer networks. University of California at Santa Cruz, Santa Cruz, Tech. Rep
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Article MATH Google Scholar
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol, vol 2, pp 27:1–27:27. http://www.csie.ntu.edu.tw/~cjlin/libsvm
Ali H, Ahmad N, Yahya KM, Farooq O (2012) A medium vocabulary Urdu isolated words balanced corpus for automatic speech recognition. In: 2012 international conference on electronics computer technology (ICECT 2012), pp 473–476
Ali H, Ahmad N, Zhou X, Ali M, Manjotho A (2014) Linear discriminant analysis based approach for automatic speech recognition of Urdu isolated words. In: Communication technologies, information security and sustainable development, ser. communications in computer and information science, vol 414. Springer International Publishing, pp 24–34
Ali H, Ahmad N, Zhou X, Iqbal K, Ali SM (2014) DWT features performance analysis for automatic speech recognition of Urdu. SpringerPlus 3(1):204
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank Nasir Ahmad, University of Engineering and Technology Peshawar Pakistan and Tillman Weyde, City University London for their useful feedback during this work.

Hazrat Ali is grateful for funding from the Erasmus Mundus Strong Ties Grant. Emmanouil Benetos was supported by the UK AHRC-funded Project `Digital Music Lab-Analysing Big Music Data', Grant No. AH/L01016X/1 and is supported by a UK RAEng Research Fellowship, grant no. RF/128. Hazrat and Son have equal contribution to the paper.

Author information

Authors and Affiliations

Department of Electrical Engineering, COMSATS Institute of Information Technology, University Road, Abbottabad, 22060, Pakistan
Hazrat Ali
Department of Computer Science, City University London, Northampton Square, London, EC1V 0HB, UK
Son N. Tran, Emmanouil Benetos & Artur S. d’Avila Garcez
School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK
Emmanouil Benetos

Authors

Hazrat Ali
View author publications
You can also search for this author in PubMed Google Scholar
Son N. Tran
View author publications
You can also search for this author in PubMed Google Scholar
Emmanouil Benetos
View author publications
You can also search for this author in PubMed Google Scholar
Artur S. d’Avila Garcez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hazrat Ali.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ali, H., Tran, S.N., Benetos, E. et al. Speaker recognition with hybrid features from a deep belief network. Neural Comput & Applic 29, 13–19 (2018). https://doi.org/10.1007/s00521-016-2501-7

Download citation

Received: 25 January 2016
Accepted: 20 July 2016
Published: 17 August 2016
Issue Date: March 2018
DOI: https://doi.org/10.1007/s00521-016-2501-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker recognition with hybrid features from a deep belief network

Abstract

Access this article

Similar content being viewed by others

Acoustic Modeling with Deep Belief Networks for Russian Speech Recognition

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Convolutional neural network vectors for speaker recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker recognition with hybrid features from a deep belief network

Abstract

Access this article

Similar content being viewed by others

Acoustic Modeling with Deep Belief Networks for Russian Speech Recognition

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Convolutional neural network vectors for speaker recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation