Conclusion
In this survey, the aim is to explore speaker recognition in depth, starting by describing the types of biometrics and their characteristics. The difference between speaker recognition and speech recognition was highlighted in this paper to avoid misusing these words. Speaker recognition and its applications in the real-world were explained. A summary of the chronology of advancements in speaker recognition in the past decade was presented. The technology of speech signal processing was explained, and issues related to variability, insufficient data, and background noise were identified as challenges in getting robust speaker recognition systems.
Table 6
Progress in speaker recognition in the last decade.
Author/ Year
|
Features extraction
|
Method
|
Database(Language)
|
Population(No. of speakers)
|
Accuracy (%)
|
[49]/
2010
|
DWT (Daubechis, Symlets, Coiflets)
|
MLP
|
Self-generated - 2 Czech Speaker Corpora (Czech language)
|
Corpora 1 - 10,
Corpora 2 - 50
|
SYML20 with MLP = 98% IR
|
[50]/
2011
|
MFCCs
|
FMMNN
|
Self-generated
(Marathi Language)
|
50
|
99.9% IR
|
[51]/
2011
|
MFCCs
|
GMM-UBM
|
TIMIT
(English Language)
|
100
|
80% for both limited and noisy data
|
[52]/
2011
|
13 MFCCs + 13Д + 13ДД
|
CHMM
|
Self-generated (Arabic Language)
|
10
|
100% for text dependent, 80% for text-independent
|
[53]/
2012
|
13 MFCCs + 13Д + 13ДД
|
FCM + FSVM
|
KING Speech (English Language)
|
51
|
98.76% IR
|
[54]/
2013
|
MFCCs
|
HMM+GFM
(fusion)
|
VoxForge, NIST 2003 (English Language)
|
100,
140
|
93% IR
|
[55]/
2014
|
MFCCs
|
LFA-SVM Linear Kernel
|
TIMIT
(English Language)
|
38
|
82.84% IR
|
[56]/
2017
|
MFCCs
|
NN
|
ELSDSR
(English Language)
|
22
|
93.2%
|
[57]/
2019
|
MFCCs
|
CNN
|
Self-generated (English, Hindi & Marathi Language), SRL82 (Chinese Language)
|
50
|
CNN = 71% IR for SRL82;
CNN = 75% IR for real-world voice sample
|
[58]/
2020
|
MFCCs
|
AFEASI
|
LibriSpeech (English language)
|
251
|
AFEASI = 0.95 accuracy
|
[59]/
2013
|
TTESBCC
|
GMM
|
Self-generated (Marathi Language)
|
25
|
Neutral speech = 98.6% IR;
Whisper speech = 55.8% IR
|
[60]/
2015
|
NDSF
|
Not mentioned
|
Self-generated (English Language), MVSR-IITG (Hindi Language)
|
100
|
98%- 100% IR
|
[61]/
|
Short-term magnitude
|
CNN
|
VoxCeleb2,
|
6000
|
ResNet-50 = 3.95% EER
|
2018
|
spectrogram
|
(ResNet)
|
VoxCeleb1 (Multi-languages)
|
1251
|
|
[62]/
2019
|
x-vectors
|
E-TDNN
|
NIST SRE18:
CMN2 (Arabic language), VAST
|
~4500
~7000
|
E-TDDN = 4.95% EER;
E-TDDN = 11.1% EER
|
[63]/
2019
|
x-vectors
|
DNN (VOCALIZE)
|
Mobile recordings (GBR-ENG), Landline recordings (English language)
|
534
387
|
x-vector = 1.40% EER
|
[64]/
2012
|
MFCCs + Phase Information
|
GMM
|
NTT, JNAS (Japanese Language)
|
35,
270
|
98.8% IR
|
[65]/
2015
|
i-vector + d-vector
|
PLDA, DNN+DTW
|
Self-generated (Chinese Language)
|
100
|
~ 2% EER
|
[66]/
2016
|
MFCCs + LPCs
|
Pearson’s correlation
|
Self-generated (not mentioned)
|
10
|
WRA = 100%, 93.33% IR
|
[67]/
2018
|
LDA+MFCCs
|
GPLDA
|
TIMIT_8 (English Language)
|
130
|
Uncoded speech = 0.91% EER, Synthesized speech = 12.5% EER
|
[68]/
2020
|
MFCCT
(MFCC + time-based)
|
DNN
|
LibriSpeech (English language)
|
100
|
92.9%
|
[69]/
2020
|
Multitaper
(MFCC + PNCC)
|
ELM
|
TIMIT
|
124
|
97.52% for clean speech, 86.70% for AWGN noise,
85.70% for babble noise, 85.96% for street noise
|
Adversarial attacks were also discussed as they have become a serious issue when dealing with machine learning and deep learning. The structure of speaker recognition and the choices on classifiers were explained thoroughly.
There is currently a great demand for the development of technologies that integrate biometric systems due to their wide range of applications, especially when the identification of the individual is needed. Following years of research and development, devices that use voice as the primary mode of interaction, such as Google Home, Amazon Echo (Alexa), Apple’s Siri, and Samsung’s Bixby, are now widely available. Most of these are used in households where multiple people are expected to interact with the device. Such tools and technology gain popularity not only for home usage but also for making the interaction between human and humanoid robots more realistic. Despite the vast amount of work in this area, there are still significant challenges in getting highly accurate systems for practical scenarios.
Declaration of Competing Interest
None.
Acknowledgment
The authors would like to thank Universiti Tun Hussein Onn Malaysia (UTHM) for funding this research.
References
Biometrics: Authentication & Identification - 2020 Review. (2019). Retrieved from https://www.thalesgroup.com/en/markets/digital-identity-and-security/ government/inspired/biometrics.
Zheng TF, Li L. Robustness-related issues in speaker recognition. Springer; 2017.
De Luis-Garaa R, Alberola-Lopez C, Aghzout O, Ruiz-Alzola J. Biometric identification systems. Signal Process 2003;83:2539-57. https://doi.org/10.1016/j. sigpro.2003.08.001.
Singh N. A study on speech and speaker recognition technology and its challenges. In: Proceedings of national conference on information security challenges; 2014. p. 34-6.
Sharma, A.M. (2019). Speaker recognition using machine learning techniques. (Master’s Projects). Retrieved from https://scholarworks.sjsu.edu/etd_projects/ 685.
Zhang Z. Mechanics of Human Voice Production and Control. J Acoust Soc Am 2016;140(4):2614. https://doi.org/10.1121/1.4964509.
Furui S. 40 years of progress in automatic speaker recognition. In: Tistarelli M, Nixon MS, editors. Advances in biometrics”. ICB 2009. lecture notes in computer science, 5558. Berlin, Heidelberg: Springer; 2009.
Sujiya S, Chandra E. A review on speaker recognition. Int J Eng Technol (IJET) 2017;9(3):1592-8.
Imam SA, Bansal P, Singh V. Review: speaker recognition using automated systems. AGU Int J Eng Technol (AGUIJET) 2017;5:31-9.
Kershet J, Bengio S. Introduction. In: Margin L, Methods K, editors. Automatic speech and speaker recognition. West Sussex, United Kingdom: John Wiley & Sons Ltd; 2009.
Kikel, C. (2019, November 11). Difference between voice recognition and speech recognition. Retrieved from https://www.totalvoicetech.com/difference- between-voice-recognition-and-speech-recognition/.
Hanifa RM, Isa K, Mohamad S. Malay speech recognition for different ethnic speakers: an exploratory study. In: 2017 IEEE symposium on computer applications & industrial electronics (ISCAIE); 2017. p. 91-6.
Biometric Today. (2018). Retrieved from https://biometrictoday.com /5-differences-between-voice-and speech-recognition/.
Sharma V, Bansal PK. A review on speaker recognition approaches and challenges. Int J Eng Res Technol 2013;2(5):1580-8.
Kaphungkui NK, Kandali AB. Text dependent speaker recognition with back propagation neural network. Int J Eng Adv Technol (IJEAT) 2019;8(5):1431-4.
Suchitha TR, Bindu AT. Feature extraction using MFCC and classification using GMM. Int J Sci Res Dev (IJSRD) 2015;3(5):1278-83.
Singh N, Agrawal A, Ahmad Khan R. A critical review on automatic speaker recognition. Sci J Circ, Syst Signal Process 2015;4(2):14-7.
Ibrahim, Y.A., Odiketa, J.C. & Ibiyemi, T.S. (2017). Preprocessing technique in automatic speech recognition for human computer interaction: an overview. Retrieved from https://anale-informatica.tibiscus.ro/download/lucrari/15-1-23-Ibrahim.pdf.
Cutajar M, Gatt E, Grech I, Casha O, Micallef J. Comparative study of automatic speech recognition techniques. IET Signal Proc 2013;7(1):25-46.
Malik S, Afsar FA. Wavelet transform based automatic speaker recognition. In: IEEE 13th international multitopic conference; 2009. p. 1-4.
Singh N, Khan RA, Shree R. Applications of speaker recognition. Procedia Eng 2012;38(2012):3122-6. https://doi.org/10.1016/j.proeng.2012.06.363. ISSN 1877-7058.
Zheng T, Li L. Robustness-related issues in speaker recognition. Springer; 2017.
Gupta H, Gupta D. LPC and LPCC method of feature extraction in speech recognition system. In: 2016 6th international conference - cloud system and big data engineering (Confluence), Noida, 2016; 2016. p. 498-502. https://doi.org/10.1109/CONFLUENCE.2016.7508171.
Kaur K, Jain N. Feature extraction and classification for automatic speaker recognition system - a review. Int J Adv Res Comput Sci Softw Eng 2015;5(1):1-6.
Jamal N, Shanta S, Mahmud F, Sha’abani MNAH. Automatic speech recognition (ASR) based approach for speech therapy of aphastic patients: a review. In: AIP Conference Proceedings. 1883; 2017.
Kumar P, Chandra M. Hybrid of wavelet and MFCC features for speaker verification. In: IEEE world congress on information and communication technologies (WICT); 2011. p. 1150-4.
Mohammed, R.A., Ali, A.E. & Hassan, N.F. (2019). Journal of Al-Qadisiyah for computer science and mathematics, 11(3), pp. 21-30.
Janse PV, Magre SB, Kurzekar PK, Deshmukh RR. A comparative study between MFCC and DWT feature extraction technique. Int J Eng Res Technol (IJERT) 2014;3(1):3124-7.
Gomar MG. System and method for speaker recognition on mobile devices. Google Patents; 2015.
Nematollahi MA, Al-Haddad SAR. Distant speaker recognition: an overview. Int J Humanoid Rob 2015;12:1-45.
Rosenberg, et al. L16: speaker recognition. Lect Slides 2007. Retrieved from, http://research.cs.tamu.edu/prism/lectures/sp/l16.pdf.
Gish H, Schmidt M. Text-independent speaker identification. Signal Process Mag, IEEE 1994;11:18-32.
Gaikward SK, Gawali BW, Yannawar P. A review on speech recognition technique. Int J Comput Appl 2010;10(3):16-24.
Sturim, D.E., Campbell, W.M. & Reynolds, D.A. (2007). Classification methods for speaker recognition. proceedings of speaker classification i: fundamentals, features, and methods, pp. 278-97. 10.1007/978-3-540-74200-5_16.
Li Z, Li L. Robustness-related issues in speaker recognition. Springer Briefs Signal Process 2017:39-48.
Muda L, Begam M, Elamvazuthi L. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. J Comput 2010;2(3):138-43.
Tolba H. A high-performance text-independent speaker identification of Arabic speakers using a CHMM-based approach. Alexandr EngJ 2011 2011;50:43-7. https://doi.org/10.1016/j.aej.2011.01.007.
Nakagawa S, Wang L, Ohtsuka S. Speaker identification and verification by combining MFCC and phase information. IEEE Trans Audio, Speech, Lang Process 2012;20:1085-95.
Sandouk, U. (2012). Speaker recognition: speaker diarization & identification. (Master’s Thesis). Retrieved from https://studentnet.cs.manchester.ac.uk/ resources/library/thesis_abstracts/ProjProgReptsMSc12/Sandouk-Ubai-ProgressReport.pdf.
Doddington GR. Speaker recognition - identifying people by their voices. Proc IEEE 1985;73(11):1651-65.
Kotti M, Moschou V, Kotropoulos C. Speaker segmentation and clustering. Signal Process 2007;88(5):1091-124.
Haohui (2019). Adversarial attacks in machine learning and how to defend against them. Notes from the keynote speaker speech by Professor Ling Liu at the 2019 IEEE big data conference. Retrieved from: https://towardsdatascience.com/adversarial-attacks-in-machine-learning-and-how-to-defend-against-them- a2beed95f49c.
Carlini N, Wagner D. Audio adversarial examples: targeted attacks on speech-to-text. In: 2018 IEEE, symposium on security and privacy workshops; 2018.
Yuan X, Chen Y, Zhao Y, Long Y, Liu X, Chen K, Zhang S, Huang H, Way X, Gunter CA. CommanderSong: a systematic approach for practical adversarial voice recognition. In: Proceedings of the 27th USENIX security symposium, August 15 - 17; 2018. p. 49-64. ISBN: 978-1-939133-04-5.
Li Z, Shi C, Xie Y, Liu J, Yuan B, Chen Y. Practical adversarial attacks against speaker recognition system. In: Proceedings of the 21st international workshop on mobile computing systems and applications (HotMobile ’20), March 3-4; 2020. https://doi.org/10.1145/3376897.3377856. Austin, TX, USA. ACM, New York, NY, USA, 6 pages.
Singh N, Agrawal A, Khan RA. The development of speaker recognition technology. Int J Adv Res Eng Technol (IJARET) 2018;9(3):8-16.
Sharma A, Singla SK. State-of-the-art modeling techniques in speaker recognition. Int J Electron Eng 2017;9(2):186-95.
Shaver CD, Acken JM. A brief review of speaker recognition technology. Electr Comput Eng Fac Publ Presentat 2016;350. Retrieved from, https://pdxscholar. library.pdx.edu/ece_fac/350.
Kral P. Discrete wavelet transform for automatic speaker recognition. In: Image and signal processing (CISP), 2010 3rd international congress on, 2010; 2010. p. 3514-8.
Jawarkar N, Holambe R, Basu T. Use of fuzzy min-max neural network for speaker identification. In: Recent trends in information technology (ICRTIT), 2011 International Conference on; 2011. p. 178-82.
Krishnamoorthy P, Jayanna HS, Prasanna SRM. Speaker recognition under limited data condition by noise addition. Expert systems with applications, 38. Elsevier. 2011; 2011. p. 13487. https://doi.org/10.1016/j.eswa.2011.04.069.
Tolba H. A high-performance text-independent speaker identification of Arabic speakers using a CHMM-based approach. Alex Eng J 2011 2011;50:43-7. https://doi.org/10.1016/j.aej.2011.01.007.
YuJuan X, Hengjie L, Ping T. Hierarchical fuzzy speaker identification based on FCM and FSVM. In: Fuzzy systems and knowledge discovery (FSKD), 2012 9th international conference; 2012. p. 311-5.
Bhardwaj S, Srivastava S, Hanmandlu M, Gupta J. GFM-based methods for speaker identification. IEEE Trans Cybern 2013;43:1047-58. 2013.
Shen X, Zhai Y, Wang Y, Chen H. A speaker recognition algorithm based on factor analysis. In: 2014 7th international congress on image and signal processing; 2014. p. 897-901.
Soleymanpour M, Marvi H. Text-independent speaker identification based on selection of the most similar feature vectors. Int J Speech Technol 2017;20: 99-108. 2017.
Jagiasi R, Ghosalkar S, Kulal P, Bharambe A. CNN based speaker recognition in language and text-independent small scale system. In: Proceedings of 3rd international conference on IoT in social, mobile, analytics and cloud (I-SMAC); 2019. p. 176-9.
Li R, Jiang J-Y, Liu J, Hsieh C-C, Wang W. Automatic speaker recognition with limited data. In: the 13th ACM international conference on web search and data mining (WSDM ’20); 2020.
Jawarkar NP, Holambe RS, Basu TK. Speaker identification using whispered speech. In: Communication systems and network technologies (CSNT), 2013 international conference on, 2013; 2013. p. 778-81.
Chougule SV, Chavan MS. Robust spectral features for automatic speaker recognition in mismatch condition. Procedia Comput Sci 2015;58:272-9. https://doi. org/10.1016/j.procs.2015.08.021.
Chung, J.S., Nagrani, A. & Zisserman, A. (2018). VoxCeleb2: deep speaker recognition. arXiv:1806.05622.
Villalba J, Chen N, Snyder D, Garcia-Romero D, McCree A, Sell G, Borgstrom J, Richardson F, Shon S, Grondin F, Dehak R, Carcia-Perera LP, Povey D, Torre- Carraswuillo PA, Khudanpur S, Dehak N. State-of-the-art speaker recognition for telephone and video speech: the JHU-MIT submission for NIST SRE18. Proc. Interspeech 2019 2019:1488-92. https://doi.org/10.21437/Interspeech.2019-2713.
Kelly F, Forth O, Kent S, Gerlach L, Alexander A. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors. In: AES international conference on audio forensics; 2019.
Nakagawa S, Wang L, Ohtsuka S. Speaker identification and verification by combining MFCC and phase information. IEEE Trans Audio, Speech, Lang Process 2012;20:1085-95.
Li L, Lin Y, Zhang Z, L, Wang D. Improved deep speaker feature learning for text-dependent speaker recognition. In: APSIPA annual summit and conference; 2015. p. 426-9.
Desai N, Tahilramani N. Digital speech watermarking for authenticity of speaker in speaker recognition system. In: International conference on microelectronics and telecommunication engineering; 2016. p. 105-9.
Zergat KY, Selouani SA, Amrouche A. Feature selection applied to G.729 synthesized speech for automatic speaker recognition. In: IEEE 5th international congress on information science & technology (CIST); 2018. p. 178-82.
Jahangir R, Teh YW, Memon NA, Mujtaba G, Zareei M, Ishtiaq U, Akhtar MZ, Ali I. Text-independent speaker identification through feature fusion and deep neural network. IEEE Access 2020;8:32187-202. https://doi.org/10.1109/ACCESS.2020.2973541.
Bharat KP, Rajesh KM. ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score. Multimed Tools Appl 2020;79:28859-83. https://doi.org/10.1007/s11042-020-09353-z.
Rafizah Mohd Hanifa obtained her bachelor’s degree in computer science from Universiti Sains Malaysia (USM) in 1999, followed by a master’s degree in Information Technology from Universiti Utara Malaysia (UUM) in 2001. She is currently a Ph.D. student at Universiti Tun Hussein Onn Malaysia (UTHM). Her research interests include speech processing, artificial intelligence, and augmented reality.
Khalid Isa graduated from Universiti Teknologi Malaysia in 2001 with a BSc in Computer Science. He pursued his MSc. in Computer Systems Engineering and Communications at Universiti Putra Malaysia, graduated in 2005. In 2014, he completed his Ph.D. degree in Electrical and Electronic Engineering at Universiti Sains Malaysia, specialized in Computational Intelligence and Underwater Robotics.
Shamsul Mohamad obtained his BSc and MSc in Computer Science from Universiti Teknologi Malaysia in 1999 and Universiti Sains Malaysia in 2004 respectively. He completed his Ph.D. degree in Computer Science at Universiti Teknologi Malaysia, with a specialization in Crowd Simulation. His-research interests include crowd simulation, artificial intelligence, and the Internet of Things.
Dostları ilə paylaş: |