ABSTRACT Voice is a behavioral biometric that conveys information related to a person’s traits, such as the speaker’s ethnicity, age, gender, and feeling. Speaker recognition deals with recognizing the identity of people based on their voice. Although researchers have been working on speaker recognition in the last eight decades, advancements in technology, such as the Internet of Things (IoT), smart devices, voice assistants, smart homes, and humanoids, have made its usage nowadays trendy. This paper provides a comprehensive review of the literature on speaker recognition. It discusses the advances made in the last decade, including the challenges in this area of research. This paper also highlights the system and structure of speaker recognition as well as its feature extraction and classifiers. The use of speaker recognition in applications is also presented. As recent studies showed the possibility of fooling machine learning into giving an incorrect prediction; thus, the adversarial attack is also discussed. The aim is to enhance researchers’ understanding in the area of speaker recognition.
Introduction
The growing interest in security has seen a rise in the use of biometrics. Besides the face, other unique features such as retina, voice, and iris can also distinguish between people. As shown in Fig. 1, biometrics can be classified into two categories: physiological and behavioral [1,2]. The former includes the face, fingerprint, and iris, while the latter includes voice, keystroke, and signature.
Table 1 provides the typical characteristics of biometric technologies and their performances in terms of accuracy,
ease of use, user acceptance, ease of implementation, and cost [3,4]. Based on the information in the table, it can be deduced that voice is one of the most useful technology, as it is easy to use, easily implemented, and widely accepted by users due to its low cost. Furthermore, the study by [5] asserted that besides iris, fingerprint,
and face, voice is another useful biometric because it provides a comparable and much higher level of security. Meanwhile, the study in [1] stated that voice could be used to differentiate people because each person’s voice has some unique characteristics.
In general, any sound produced by humans to communicate meanings, ideas, opinions, etc., is called a voice. In a specific term, voice is defined as any sound produced by vocal fold vibration, which occurs when air is under pressure from the lungs [6]. Voice is the most natural communication tool used by humans. It conveys the speaker’s traits, such as ethnicity, age, gender, and feeling.
Research in speaker recognition has been conducted worldwide in the last eighty years but has increased significantly due to the advancements in signal processing, algorithm,
architecture, and hardware [7]. Automatic Speaker Recognition (ASR) is a digital signal processing field related to recognizing people’s voices. Every individual’s voice is unique due to the differences in the shapes of the vocal tract, larynx sizes, and other parts of human voice production organs [8,9]. The features of voice are dependent on its pace or speed, volume, pitch level, and quality, while the articulation rate and speech pauses rely on the speaker’s speaking style [8]. Although there have been several review papers published in the field of ASR, each covers different perspectives. Thus, this paper aims to provide a thorough review of the ASR system and its latest issues and challenges.
This paper is organized as follows: Section 2 presents a detailed explanation of the difference between the terms “speaker recognition” and “speech recognition”, as well as highlights the structure of speaker recognition including its feature extraction, classifiers, and models. The taxonomy of speech processing technology involving the types and issues of speaker recognition including the adversarial attack is addressed in Section 3. Section 4 presents the milestones in speaker recognition in the last decade based on the feature extraction being used. The last section concludes the paper.
Speaker recognition
In speech processing, speaker recognition and speech recognition are the two applications commonly used by researchers to analyze uttered speech [10]. Before delving further into the structure of speaker recognition, it is vital to understand the difference between speaker recognition and speech recognition. Speech recognition is concerned with the words being spoken, while the speaker or voice recognition aims to recognize the speaker rather than the words [11].
Speaker recognition vs. speech recognition
Speech recognition is useful for people with various disabilities, such as those with physical disabilities who find typing the words difficult, painful, or impossible, and for those who have difficulties recognizing and spelling words, such as those with dyslexia [12]. Since speech recognition deals with
converting audio into text, its effectiveness depends heavily on the language and the text corpus [5]. On the other hand, speaker recognition is to identify the person who is speaking. Pitch, speaking style, and accent are some of the features that contribute to the differences [11]. Speaker recognition technology has been used in various applications, such as biometric, security, and even human-computer interaction. Table 2 summarizes the differences between speaker recognition and speech recognition in terms of recognition, purpose, focus, and application [13].
The advancement in various fields has increased the importance of speaker recognition systems, especially in identifying a person’s identity.
Structure of speaker recognition
Speaker recognition involves the process of finding the identity of an unknown speaker and comparing his/her voice with those available on the database. It is a one-to-many comparison [14]. The basic framework and components of speaker identification, as shown in Fig. 2, consist of two phases:
enrolment, also known as training, and recognition, also known as testing [15,16]. The following subsections present the main phases involved in the framework.
Pre-processing
Pre-processing is the first step in speech signal processing, and it involves converting an analogue signal into a digital signal [9,17]. Interference due to noise often occurs during speech recording, causing the performance to degrade. Thus, pre-processing is a crucial and critical step, as improper pre-processing conducted on the recorded speech input will decrease the classification performance [17]. The main objective in the pre-processing stage is to modify the speech signal to be suitable for feature extraction analysis [16,18]. Different methods can be adopted for noise-reduction algorithms, and the two most frequently used are spectral subtraction and adaptive noise cancelation [18]. However, [19] highlighted that the function to be used during the pre-processing stage is very much dependent on the approach employed at the feature extraction stage. Some of the commonly used functions include noise removal, endpoint detection,
pre-emphasis, framing, and normalization [16,17,19].
Biometric
Physiological |
Face
Fingerprint
— Hand
— Iris
— DNA
L
Behavioral
—Voice
— Keystroke
Signature