Channel conditions include the type of microphone used to record the speech, the way the speech is encoded/transmitted and whether the speech contains noise or is free from noise.
If the application must deal with a variety of channel conditions, the classifier could employ channel compensation to boost performance.
Amount of speech data available for enrolment and detection
Available computational and memory resources
The output of the system
If more data are available, then the classifier has higher level of information, which helps in getting better classification.
Embedded devices have limited amounts of processing power and available memory.
A cell phone has very limited capabilities, which uniquely constrains the speaker recognizer.
The output of the system is dependent on the end-user.
As for forensic applications, the system must return word usage and phonotactic information.
Furthermore, the type of output may need to be a hard decision, a human interpretable score, or a relative score.
Table 5
Types of speaker recognition.
Type of Speaker Recognition
Description
Speaker Identification
The task of speaker identification is to classify an unknown voice spoken anonymously as belonging to one of a set of N reference speakers [40].
The task of speaker verification is to decide whether the unknown voice belongs to a specific reference speaker with two possible outcomes: to accept the reference speaker or to reject the impostor [40].
Speaker Detection
The speaker detection’s task is to mark the target speaker’s speeches correctly when the target speaker is presented to the system together with the testing speeches [39].
Speaker Segmentation
The task of speaker segmentation is to find the points where the speaker changes when an audio stream is presented to the system
Speaker Clustering
Speaker Diarization
[41]
[41].
The task of speaker clustering is to cluster correctly many utterances presented to the system and usually the task is done online [41]. The task of speaker diarization is to split the audio automatically into speaker segments and determine which segments are uttered by the same speaker [41].
Surveillance is mainly important for security agencies to collect important information, such as electronic eavesdropping of telephone and radio conversations [21]. A filter mechanism is needed to find the relevant information, such as recognizing the target speakers who are of interest to the service. Parole monitoring is another application, which calls parolees at a random time to verify that they are in the restricted area [35]. According to [31], it is also used in applications for remote time and attendance logging and prison telephone usage.
Forensic
Speaker recognition can also be applied in forensic. This can be done if there is a speech sample recorded during the crime, and the suspect’s voice can be compared for voice sample matching. The result can prove the identity of the criminal and discharge the innocent during a court case.
Speech processing technology
Speech signal processing technology has become a popular communication technology, as many applications use speech to enhance everyday human life. In digital signal processing, ASR is an essential tool for recognizing people based on their voice [2,4]. Human speech can provide much information as the human voice forms a vital characteristic of an individual [7,29]. Accent, language, speech, emotion, gender, and the speaker’s identity are some of the information contained in the human voice [2,36], as shown in Fig. 4.
The field of speaker recognition has gained more attention lately. Although researchers have been working on speaker recognition in the last eight decades, advancements in technology, such as the Internet of Things (IoT), smart devices, voice assistants, and smart homes, have made it popular [5]. Fig. 5 presents the detailed taxonomy of speech processing [37,38].
As illustrated in the figure, the domain of speech processing is divided into three major categories: analysis/synthesis, recognition, and coding. Recognition can be further divided into three parts: speech recognition, speaker recognition, and language recognition. Since the focus of this paper is on speaker recognition, the following subsections will discuss this in further detail.
Types of speaker recognition
Speaker recognition can be further classified into speaker identification, speaker verification, speaker detection, speaker segmentation, speaker clustering, and speaker diarization [39], as shown in Fig. 6. A brief description of these categories is given in Table 5 to help in understanding their differences better.
Research in the field of speaker recognition has recently increased. It can be further classified into a text-dependent system or a textindependent system and an open set system or a closed set system [8,15], as shown in Fig. 7.
In a text-dependent system, the same text is being spoken both during the training phase and the testing phase, while in the textindependent system, there is no constraint on the text being spoken, which makes it more convenient for the speakers [2,4,8]. Thus, the text-dependent system’s training process is much faster since it has a fixed set of inputs to validate. But its limitation is the speaker’s inconvenience of having to utter the same words each time. In contrast, the text-independent system’s training phase is more prolonged, as the model does not consider what is being spoken but instead tries to convert the audio to feature vectors to identify the speaker correctly [5,29].
An open set system refers to a system that does not limit the number of trained speakers, and the test speakers may comprise other than the trained speakers. However, in a closed set system, the unknown voice must come from a set of known speakers [8,15,29].
Speaker recognition is not an easy task as many factors create variances in the speech signals during training and testing sessions, such as changes in people’s voices due to time, health conditions, speaking rates, etc. [16].
Issues in speaker recognition
Although human voice recognition may be perceived as easy, such as in recognizing a person’s voice on the phone, the implementation of speaker recognition systems is challenging. There are many speaker recognition issues and among the most challenging issues in implementing reliable speaker recognition systems are variability and insufficient data [7,17]. Besides those two problems, [2] and [14] further identified background noise as one of the issues that can also influence speaker recognition’s performance. The problems may be related to either the speaker or technical errors [12]. In the following subsections, a detailed discussion of each case is