Speaker Recognition-Identifying People by their Voices george r. Doddington, member, ieee



Yüklə 210,37 Kb.
səhifə2/10
tarix20.09.2023
ölçüsü210,37 Kb.
#145400
1   2   3   4   5   6   7   8   9   10
doddington1985

100

OPTIMUM IDENTIFICATION

2 S 10 20 . 50 100





POPULATION SIZE
Fig. 1. This figure illustrates the relative performance for speaker verification and speaker identification for a hypo­thetical task in which the feature vector for speaker i is distributed as a multivariate Gaussian distribution with mean vector u(/) and an identity covariance matrix and further in which the speakers are chosen randomly so that the mean vector for speaker i is also a multivariate Gaussian distribu­tion with zero mean and a covariance matrix equal to a constant S times the identity matrix [19]. For the identifica­tion task, the decision is to choose that speaker for which the likelihood of the input vector is maximum, given the observed feature vector. For the verification task there are three possible decisions. First, the optimum decision is to choose that decision hypothesis for which the likelihood of the observed feature vector is greatest, given knowledge of the impostor distributions. Second, since it is unlikely that there will be knowledge of the impostors in any real appli­cation, the distribution of the impostor population is esti­mated for indefinitely large impostor populations. Finally, the easiest and often a completely satisfactory approach is to assume that the probability distribution for impostors is extremely diffuse and therefore to assume a constant likeli­hood function for impostors. The results of a computer simulation of this example are plotted for three values of speech feature vector dimensionality; namely, m = 1, 2, and 4. The variance of impostor mean features 5 was chosen to provide a suboptimum verification error rate of 2.5 percent. (S = 2557 for m = 1, S = 72.4 for m = 2, and S = 11.0 for m = 4.) The various recognition hypotheses were assigned equal a priori likelihood, and the error rates shown were obtained by averaging over a very large number of randomly chosen populations. Note that optimum verification always yields better error rates than optimum identification, except when the population size is 2, in which case verification and identification are the same task. The most important result illustrated by this example is that verification performance, even for suboptimum decision rules, remains satisfactory as the population size becomes large, whereas the performance for speaker identification continues to degrade with the identification error rate approaching 100 percent for suffi­ciently large populations.
for optimum decision rules applied to a multivariate Gaussian model. In fact, population size is a critical perfor­mance parameter for speaker identification, with the prob­ability of error approaching 1 for indefinitely large popula­tions. The performance for speaker verification is unaf­fected by population size, however. Although the perfor­mance for speaker verification is stable with increasing speaker population size, there is one difficulty with verifica­tion that is not present with identification. This is, namely, that a much more comprehensive grasp of the variability of the speech features used for discrimination is required in the verification task. Thus while determining which refer­ence is "closest" to the input token may serve the identifi­cation task reasonably well without statistical calibration, the verification task demands proper statistical characteriza­tion of verification features in order to judge "close enough."
Speaker identification as defined above is also sometimes called "closed-set" identification, which contrasts it from "open-set" identification. In open-set identification the possibility exists that the unknown voice token does not belong to any of the reference speakers. The number of possible decisions is then N 4- 1, which includes the option to declare that the unknown token belongs to none of the reference speakers. Thus open-set identification is a combi­nation of the identification and verification tasks which combines the worst of both—performance is degraded by the complexity of the identification task, and the rejection option requires good characterization of speech feature statistics.
The degree of control over the generation of the speech token is another important speaker recognition parameter. Fixed-text or text-dependent speech tokens are used in applications in which the unknown speaker wishes to be recognized and is therefore cooperative. Free-text or text­independent speech tokens are required in those applica­tions where such control cannot be maintained, either because the speaker is not cooperative or perhaps because the recognition must be done unobtrusively. Generally speaking, recognition performance in fixed-text applica­tions will be better than in free-text applications, because better calibration of the input speech token is possible with identical reference speech material and because the ability to control the text of the input speech will often extend also to the control of the speaker and his environment. Serving to offset this somewhat, free-text applications often provide longer samples of speech for recognition, which tends to improve recognition performance.
Armed with these task definitions we can now tabulate the recognition tasks required by the various applications. This is done in Table 1. Of these applications, security applications will exhibit the best speaker recognition per­formance because of the cooperative user and controlled conditions. A most difficult problem in the forensic and reconnaissance applications is to establish a valid statistical model upon which verification decisions may be based. An effective statistical model is difficult to establish in this environment because of the lack of control over the speech signal and the speaker and also because of the difficulty in predicting acoustical and transmission conditions. Even with an adequate statistical model for making verification deci­sions, the lack of control in text-independent applications invariably results in much poorer recognition performance than for those applications in which control is exercised over all aspects of the speaker verification task.



IDENTIFICATION

VERIFICATION

open-set closed-set

free-text fixed-text





Yüklə 210,37 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   10




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin