A review on speaker recognition: Technology and challenges



Yüklə 155,15 Kb.
səhifə2/7
tarix20.09.2023
ölçüsü155,15 Kb.
#145401
1   2   3   4   5   6   7
1-s2.0-S0045790621000318-main

Fig. 1. Types of biometrics: Physiological and Behavioral.

      1. Feature extraction

Feature extraction is a significant issue in the area of text-independent speaker recognition [5]. The basic principle of feature extraction is to extract a sequence of features for each short-time frame of the input signal, with the assumption that such a small segment of speech is sufficiently stationary to allow for better modeling [20]. In other words, the process retains useful and relevant information about the speech signal by rejecting redundant and irrelevant information [8,18]. This phase is vital for the next step, as it affects the behavior of the modeling process. The speaker signal is a dependent speech system whereby the speech signal is analyzed to get less variability and to identify more discriminative features by converting a speech signal to parametric values [21]. Various techniques can be used for extracting speech features in the form of coefficients, such as the Linear Prediction Coding (LPC), the Linear Prediction Cepstral Coefficients (LPCCs) and the Mel-Frequency Cepstral Coefficients (MFCCs) [8,9,17,22]. Table 3 presents a comparative summary of the merits and de-merits of the various feature extraction techniques.

      1. Models and classifiers

Speaker models and classifiers are dependent not only on the features used but also on the task to be addressed. Many factors, such as type of speech, ease of training, and computational and storage requirements, need to be considered before choosing the modeling technique [29].
Modeling techniques are categorized into generative models and discriminative models [30]. Generative models present the dis­tribution of individual classes, while discriminative models learn the boundaries between classes. Fig. 3 shows the categorization of modeling techniques.

        1. Generative models. Generative models require training data samples from the target speaker and can take the form of a statistical or non-statistical model to describe the target speaker’s feature distribution [29]. As shown in Fig. 3, generative models can be further classified into parametric and non-parametric modeling [31]. A model that assumes a structure characterized by certain parameters is known as a parametric model, whereas, in a non-parametric model, the probability density function is made with minimal assumptions [29].

          1. Parametric models. Parametric models include the Gaussian Mixture Model (GMM) and the Hidden Markov Model (HMM). The benefits of these models are the efficiency of the data used, which can be extended to evaluate the test data through statistical summaries of the data rather than the data itself [32]. The parametric model’s disadvantage is that the structure is restrictive and may not be adequate to model the task [29].

          2. Non-Parametric models. Template matching is a non-parametric model that consists of a template that is a sequence of feature vectors from a fixed phrase [14]. Non-parametric models include Dynamic Time Warping (DTW) and Vector Quantization (VQ). The advantage of these models is that they are free from assumptions about data generation [8]. Besides, this method does not require any model training [29]. The disadvantage of this model is that the pre-recorded templates are fixed. Thus, variations in speech can only be modeled using many templates, which becomes impractical [33].

        2. Discriminative models. These models require training data for both target and non-target speakers to obtain the optimal separation between the different speakers [29]. Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) are models in this category, and they are also known as soft computing modeling. The advantage of these models is their flexible architecture and discriminating training power while their disadvantage relates to having to use trial and error in obtaining an optimal structure [29].

      1. Classifier choice

The choice of a classifier is very much dependent on the application and the constraints that influence the classifier choice [34], as shown in Table 4.

    1. Speaker recognition and its application

Speaker recognition is employed in broad application areas, such as authentication, personalization, surveillance, and forensic due to high public acceptance and accuracy rate, low-cost smart devices, and more effortless installation of software. In the following subsections, some examples of the applications of speaker recognition are presented.
Table 1
Comparison of different biometric characteristics.

Biometric Type

Characteristics of biometric technologies

Accuracy

Ease of use

User acceptance

Ease of implementation

Cost

Voice

Medium

High

High

High

Low

Face

Low

Low

High

Medium

Low

Iris

Medium

Medium

Medium

Medium

High

Fingerprint

High

Medium

Low

High

Medium

Retina

High

Low

Low

Low

Medium

Hand geometry

Medium

High

Medium

Medium

High

Signature

Medium

Medium

High

Low

Medium

Table 2


Speaker recognition vs. Speech recognition.

Features

Speaker Recognition Speech Recognition

Recognition

Recognizes who is speaking by measuring voice pattern, speaking style, Recognizes what is being said and converts them into text.
and other verbal traits.

Purpose Focus

To identify the speaker. To identify and digitally record what the speaker is saying.
Biometric aspects of the speaker, such as pitch, intensity, etc., to recognize Vocabulary of what is being said by the speaker and turns the words
him/her. into digital texts.

Application

Voice biometrics. Speech to text.







Yüklə 155,15 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin