Research on speaker recognition first started in the 1930s. In March of 1932, the kidnapping and killing of Charles and Anne Lindbergh’s baby boy led to a research into speakers’ speech signals. During the suspected kidnapper’s trial, Charles Lindbergh claimed that the voice of the kidnapper, Bruno Hauptmann, was the same as the voice he heard while waiting in a car nearby where the ransom was paid [46]. Frances McGehee, who was inspired by the case, conducted the first academic research on the reliability of ear witnesses in 1937, which later became a topic of interest in forensics and psychology research [46]. The research in this area continues until today. The reasons for active research in this area in the past few decades are due to various choices of feature selection or extraction, modeling techniques, classification, and decision making, as well as the different databases used [47,48]. In the following paragraphs, the research progress in the past decade is discussed in detail based on the feature extraction techniques.
Discrete wavelet transform (DWT)
Kral in [49] studied the parameterization/classification methods for the Czech language. Two Czech speaker corpora were used in the research. The first Czech corpus contained speeches of 10 native speakers created in laboratory conditions to eliminate undesired effects (e.g. background noise, speaker overlapping, etc.). The second corpus consists of recorded speech in a lesser clean environment, i.e. with some low- level stationary background noise, by 50 Czech native speakers which was created to build a dialog system for Czech Railways. Three DWT with different coefficients (Daubechies with eight coefficients, Symlets with 14 coefficients, and Coiflet with three coefficients) were used and evaluated. The Gaussian Mixture Model (GMM) and the Multi-Layer Perception (MLP) were used as the classifiers for comparison purposes. The results revealed that the best recognition of 99% is achieved with the combination of Linear Prediction Cepstral Coefficients (LPCEPSTRA) with a GMM classifier while, the best configuration of wavelets, which was SYML20 with MLP classifiers, gave an identification rate (IR) of 98% accuracy. The research also showed that using the MLP classifier could reduce training data time to only 30 s compared with GMM, which needed at least one minute. Although the accuracy rate was high, the researcher used a closed set of speakers, which meant the testing voice came from a group of known speakers, which was not practical.
Mel frequency cepstral coefficients (MFCC)
The research conducted by [50] considered the effectiveness of the fuzzy min-max neural network (FMMNN) as the classifier for closed-set text-independent speaker identification. Since Mel-Frequency Cepstral Coefficients (MFCCs) are commonly used features in speaker recognition, the researchers used them as the features in their work. The uttered words, digits, and sentences in the Marathi language by 50 speakers were used as the database in their experiment. The researchers compared the performances of the two classifiers, FMMNN and GMM. The results revealed an accuracy of up to 99.99% with 15 s of speech utterance when using the FMMNN, while GMM only achieved 97.65% accuracy. As this research also used a closed-set system, it was thus impractical.
The work undertaken by [51] proved that speaker recognition performance could be improved by controlling the noise. Their study was conducted using the TIMIT database, with 100 speakers randomly selected. MFCCs were used as the feature vectors, and the Gaussian Mixture Model-Universal Background Model (GMM-UBM) was used as the classifier. The first set of the experiment evaluated the speaker recognition system’s performance under limited data conditions. In the second experiment, the researchers created noisy speech by adding white Gaussian noise to the clean speech. Their work proved that the speaker recognition system’s performance in limited data conditions could be enhanced to 80% via artificially increasing the number of feature vectors by adding noise. In contrast, the performance was only 78.20% for limited data and clean speech. They attributed the difference to the relative increase in the number of feature vectors.
The work in [52] used the Continuous Hidden Markov Model (CHMM) to identify Arabic speakers automatically from their voices. Ten Arabic speakers were chosen as the database to evaluate the proposed CHMM-based engine. The CHMM uses the most efficient density function without loss of generality which is the Gaussian density function. Furthermore, CHMM is more accurate than the Discrete Hidden Markov Model (DHMM) as it uses continuous observations to construct the model directly without quantization. MFCCs were used as the feature vectors, and the results showed the IR to be 100% if using text-dependent and 80% if using text-independent. Text-independent is practical as a speaker can be identified from any speech utterance.
On the other hand, [53] proposed a novel hierarchical fuzzy speaker identification method based on fuzzy c-means (FCM) clustering and fuzzy support vector machine (FSVM). FCM was used to reduce the number of training for the audio data and to compute complexity. The FSVM was then used to process the unclassifiable data to make the final decision. For evaluation purposes, the KING speech database with 13-dimensional MFCCs in their first and second derivations was combined into a 39-dimensional vector as input features. Two experiments were conducted for comparison purposes. SVM and FSVM were used as classifiers, and the results indicated that the combination of FCM and FSVM increased the IR from 94.53% to 98.76% compared to using FSVM alone.
The study in [54] presented three approaches of the generalized fuzzy model (GFM) in different roles in their research. The first model used the Hidden Markov Model (HMM) and the GFM; the second model used the GMM and the GFM, and the last model used the HMM and the GFM with fusion. These three proposed models were tested on the VoxForge speech corpus under clean and noisy conditions and a benchmark database from the National Institute of Standards & Technology 2003 (NIST 2003). For feature extraction, MFCCs were used. The results showed that the model with the HMM and the GFM with fusion gave 93% IR, followed by the other two models with 92% IR.
The work in [55] utilized latent factor analysis (LFA) to deal with channel interference in the speaker’s GMM. The algorithm used factor analysis technique to fit the differences between the speaker’s characteristics space and the channel space and removed the channel factor in the speaker’s GMM. Based on a selection of 38 speakers from the TIMIT speech database and using MFCCs as the feature vectors, the researchers tested the performance with different system modes: Gaussian Mixture Model-Support Vector Machine (GMM-SVM) Linear Kernel, Latent Factor Analysis-Support Vector Machine (LFA-SVM) Linear Kernel, Gaussian Mixture Model-Support Vector Machine (GMM-SVM) Gaussian Kernel, LFA-SVM Gaussian Kernel and GMM-UBM. They found that the LFA-SVM Linear Kernel gave the highest percentage of accuracy (82.84%) compared to the other system modes.
The study in [56] proposed two methods to find MFCC feature vectors with the highest similarity to be applied to a text-independent speaker identification system. Their experiment used 22 speakers selected from the English Language Speech Database for Speaker Recognition (ELSDSR) and utilized the Neural Network (NN) as the classifier. The first method’s recognition accuracy was 91.9%, while the second method achieved 93.2%, and the running time was reduced to 42.03% and 20% for the first and second methods, respectively. They claimed that their two approaches could be used for large-scale databases.
The research by [57] explored the text-independent system using MFCCs along with DNN and CNN. The two approaches, DNN and CNN, were adopted for comparison purposes. DNN can provide better noise immunity over GMM, while CNN has been used to better identify patterns in data and scales. Although there was no clear indication of the type of database used in their study, they mentioned recording speakers who used multiple languages (English, Hindi, and Marathi). The voices were recorded using a speaker recognition dataset from openslr developed by Tsinghua University. The results for 50 speakers taken from openslr showed that CNN gave better accuracy, which was 71% as compared to DNN with only 61%. As for the real-world voice samples with eight speakers, CNN again gave a higher accuracy of 75% compared to DNN with only 58%. Thus, CNN as a learning model excelled at identifying patterns in the input and scales much better than DNN.
Neural network-based ASR has shown remarkable power in achieving excellent recognition with enough training data. Unfortunately, the lack of training data prevents ASR systems from performing accurately. Thus, [58] proposed an adversarial few-short learning-based speaker identification framework (AFEASI) to develop robust speaker identification models using a limited training number. Besides employing metric learning-based few-shot learning, they applied adversarial learning to enhance the robustness of speaker identification. Eleven methods were adopted, including seven CNN, one prototypical network (PN), Sincnet (SC), and AFEASI. Accuracy was calculated based on the correctly identified test instance divided by total test instances. Among the methods, AFEASI achieved the highest accuracy of about 0.95 at the setting of 60 s per speaker.
Temporal Teager energy based subband cepstral coefficients (TTESBCC)
Unlike in [50], the researchers in [59] used whispered speeches with a new feature called temporal Teager energy based subband cepstral coefficients (TTESBCCs) in their study, instead of the FMMNN as the classifier for the closed set text-independent speaker identification. The TTESBCCs were compared with three other feature sets: MFCCs, temporal energy of subband cepstral coefficients (TESBCCs), and weighted instantaneous frequency (WIF). Using a self-generated database of uttered speeches in the Marathi language by 25 speakers and employing GMM as a speaker model, the researchers achieved a higher accuracy rate with the TTESBCCs. The IR for the neutral speech was 98.6%, while the IR for the whispered speech was 55.8%, which were higher compared to MFCCs, TESBCCs, and WIF.
Normalized dynamic spectral features (NDSFs)
The research work in [60] used a robust spectral feature set called Normalized Dynamic Spectral Features (NDSFs) in a mismatched condition. Their experiment’s model formation was based on three different features: Linear Prediction Cepstral Coefficients (LPCCs), MFCCs, and NDSFs. They used two different databases in their investigation. The first database consisted of 100 speakers uttering speeches in English, which was recorded using various sensors. On the other hand, the second dataset used the multi-variability speaker recognition (MVSR) for continuous Hindi speeches generated from the Indian Institute of Technology Guwahati (IITG) database. These two databases allowed them to investigate the mismatch effect. They proved that their proposed features set (NDSFs) were more robust than the cepstral features, such as the MFCCs and LPCCs, with 98% to 100% IR.
Short-term magnitude spectrograms
The research work in [61] introduced a deep Convolutional Neural Network (CNN)-based neural network speaker-embedding system, called VGGVox, that was trained to map voice spectrograms to a compact Euclidean space, where the distance directly correspond to a measure of speaker similarity. VoxCeleb2 dataset was used for training and validation, while the VoxCeleb1 dataset was used for testing. Two trunk architectures were used in their work: VGG-M and Residual-Network (ResNet). ResNet-50 trained in VoxCeleb2 gave an EER of 3.95% and a cost function of 0.429. For benchmark purposes, VoxCeleb1-E (using the entire set) and VoxCeleb1-H (within the same nationality and gender) were used. The results showed that ResNet-50, which was tested in
VoxCeleb1-E, had an EER of 4.42%. On the other hand, ResNet-50 tested in VoxCeleb1-H, had an EER of 7.33%.
X-vectors
The research conducted by [62] used the National Institute of Standards and Technology Speaker Recognition Evaluation 2018 (NIST SRE18) in their work, consisting of telephone speeches recorded from the Call My Net 2 (CMN2) corpus and videos extracted from the VAST corpus. They explored Time Delay Networks (TDNN), Extended TDNN (E-TDNN), Factorized TDNN (F-TDNN), and Residual Network with 34 layers (ResNet34). The results showed that E-TDNN x-vectors were the best single system with an EER of 11.1% in VAST and 4.95% in CMN2.
Voice Comparison and Analysis of the Likelihood of Speech Evidence (VOCALIZE) is an ASR system that lets the user conduct speaker comparisons using various features and algorithms flexibly. The study in [63] presented a new DNN-based version of VOCALIZE using x-vector, which offered a choice of feature extraction, speaker modeling, and speaker comparison approaches and allowed for speech recording at various training stages. In their work, they used landline recordings and mobile recordings (GBR-ENG) for comparison purposes. The results revealed that the adapted EER for the x-vector for mobile was 1.40%, while the adapted EER for the i-vector was 5.80%. The x-vectors outperformed the i-vectors, especially for short durations.
Features combination
The study by [64] proved that the use of phase information extraction, which is often ignored in the conventional speaker recognition method based on MFCCs, could improve the speaker IR from 97.4% to 98.8%. GMM was used for modeling the speaker model, and two sets of databases were used in this research to evaluate their method: the NTT and the Japanese Newspaper Articles Sentences (JNAS).
The research conducted by [65] proposed several enhancements for the deep neural network (DNN) feature learning. They presented a phone dependent DNN model to supply phonetic information when learning speaker features and proposed two scoring methods: segment pooling and dynamic time warping (DTW). The i-vectors and d-vectors were the two kinds of speaker vectors used in their research. The former is based on the Gaussian model, while the latter is based on neural networks. Their database comprised 100 speakers who uttered ten short phrases that contained 2-6 Chinese characters. They managed to combine the best i-vector system probabilistic linear discriminant analysis (PLDA) and the best d-vector strategy (DNN+DTW) to give a ~2% Equal Error Rate (EER). It was proven that the combination led to the best performance.
In preventing unauthorised persons from directly or indirectly attacking the speaker recognition system, [66] proposed a new security level using watermark technology. Their method hid the data by changing the sample’s frame size in the speech signal’s unvoiced part to prevent other people from noticing such data. Both the MFCCs and the Linear Prediction Coding (LPC) were used as features. Pearson’s Correlation was used as the classifier to investigate whether the speaker can be recognized (authorized) or unrecognized (unauthorised). A database of 10 speakers uttering the same word was recorded. The results showed 100% accuracy in security could be achieved using the watermark technology along with 93.33% recognition accuracy. Hence, the watermark technology could be applied to make the system more secure.
ASR via wireless communication is known to cause degradation because of synthesized speech. The research work in [67] proposed using the Gaussian probabilistic linear discriminant analysis (GPLDA) as the classifier and the feature selection using Linear Discriminant Analysis (LDA) and the result rejects 19 MFCCs features out of 69 features to improve the performance accuracy. TIMIT_8, the modified version of the TIMIT corpus, was used with 130 speakers. The results showed that the EER for the uncoded speech was 0.91%, while that the EER for the synthesized speech was 12.5%.
The work by [68] proposed a novel fusion of MFCC and time-based (MFCCT) features that were fed as input to the DNN to construct the speaker identification model. LibriSpeech dataset which consists of 100 speakers was used. Besides DNN, five other machine learning classification algorithms (i.e., Random Forest, k Nearest Neighbors, Naive Bayes, J48, and Support Vector Machine) were used for comparison purposes. The results revealed that DNN outperformed the other five classification algorithms with an overall accuracy of 92.9%.
The research in [69] aimed to identify the speakers under clean and noise background using a limited dataset. They proposed a multitaper based on MFCC and power normalization cepstral coefficients (PNCC) as feature vectors. These features were then normalized using cepstral mean and variance normalization (CMVN) and feature warping (FW). TIMIT and SITW 2016 were the databases used for evaluation. The results indicated that their proposed method provided better performance accuracy compared to the other state-of-art techniques. The accuracy using the TIMIT database was 92.52%, 86.7%, 85.7%, and 85.96% under clean speech, AWGN, babble, and street, respectively.
Table 6 summarizes the progress of research in the area of speaker recognition in the past decade.