The main factor affecting the speaker recognition system’s performance is voice variability, also known as session variability, which can be further classified as intra-variation and inter-variation [22]. Intra-variation occurs due to various factors, such as emotions, rate of utterances, mode of speech, disease, the speaker’s mood, and the emphasis given to the word at a moment [42]. On the other hand, inter-variation exists due to anatomical differences in the vocal organs and learned differences in speech mechanism [22,42]. According to [7] and [22], the most extreme variations occur during the training and testing sessions, whereby during the training sessions, the speaker utters in a clean environment. In contrast, during the testing session, the speaker speaks in a noisy condition. Hence, [4] suggested that voice recording should be conducted at some interval to avoid changes in the speaker’s voice. On the other hand, [2] and [17] further highlighted that speech signals might also change due to different transmission channels, such as the different types of microphones and headphones used during the recording of speech utterances. Thus, this led to the model having a mismatched condition.
Issues on insufficient data
Insufficient data refers to the unavailability of sufficient data to train representative models to reach an accurate decision. It represents a serious and common problem, as most applications require systems that operate with the smallest practical amount of training data recorded in the fewest number of enrolment sessions [7]. The work in [2] highlighted that the problem on this issue is the tendency to assume that the test data distribution matches the distribution represented by the speaker model. The system works if the speaker model is well trained based on long enrolment speech and enough test speeches. In her paper [4], Singh claimed that the system’s performance depends on the quantity of training data. According to [14], there is no evidence that the voice sample length provides better results. On the other hand, [2] not only believed that a sufficiently large amount of speech data for model training is necessary but also recognized that, in reality, it is difficult for a system to collect long utterances, or that the user does not speak for long enough. Thus, this led to speaker recognition being a hot topic for further investigation.
Issues on background noise
Background noise is another issue being highlighted by [22] as problematic because during training, the speaker often speaks in a clean environment. In contrast, during testing, the speaker speaks in a noisy condition. The background noise is a significant factor that impacts accuracy in speaker recognition. Accuracy is high for clean samples and low for noisy samples with no impact of babble noise [14]. The recorded speech wave that contains background noises such as white noise, music, etc., has the most significant effect on speaker modeling. It disturbs the evaluation test and degrades the performance of the speaker recognition system.
Adversarial attacks
Machine learning (ML) and deep learning (DL) have yielded impressive advances in many fields. Unfortunately, recent studies have shown the possibility of fooling machine learning into giving an incorrect prediction.
Fig. 8 shows the adversarial attacks pose where the panda was initially classified correctly. With the addition of noise, the model changed the resulting prediction into another animal, i.e., gibbon, with higher confidence. Although the initial and altered images were the same, they appeared differently to the model. The changes made were the result of the threat of adversarial attacks. If we cannot perceive the difference, we could not tell that an adversarial attack has occurred. Hence, it was difficult to tell if the result was correct or incorrect, even if the model was altered.
Adversarial attacks are categorized into targeted attacks and untargeted attacks [42]. In targeted attacks, the target class will make the target model misclassify the image into something other than the original class. Unlike the targeted attacks, untargeted attacks have no intention of the target class. The aim is to make the target model misclassify by predicting the adversarial image into a class other than the original class.
The research work in [43] constructed targeted audio adversarial examples on automatic speech recognition. They were able to turn any audio waveform into any target transcription with 100% success by adding only a small distortion. They presented proof that audio adversarial examples have different properties from those on images. Another important finding was made by [44], where they integrated the command voice into songs (CommanderSongs). They claimed that their approach was the first to generate attacks against the DNN automatic speech recognition system. Without any human noticing, the command voice was executed while music was being played over the air. The study in [45] launched a practical and systematic adversarial attack against x-vectors, the DNN-based speaker recognition system. In their work, they added an inconspicuous noise into the original audio. Interestingly, their attack fooled the speaker recognition system into making false predictions and even forced the audio to be recognized as any speaker desired by the adversary.